Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

Similar documents
Bayesian Inference in Astronomy & Astrophysics A Short Course

Why Try Bayesian Methods? (Lecture 5)

arxiv:astro-ph/ v1 14 Sep 2005

Statistics: Learning models from data

Introduction to Bayesian Inference: Supplemental Topics

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Bayesian search for other Earths

Bayesian Inference and MCMC

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Hierarchical Bayesian Modeling

Additional Keplerian Signals in the HARPS data for Gliese 667C from a Bayesian re-analysis

Lecture : Probabilistic Machine Learning

Bayesian Regression Linear and Logistic Regression

Strong Lens Modeling (II): Statistical Methods

Bayesian Experimental Designs

Hierarchical Bayesian Modeling

Lecture 6: Model Checking and Selection

Markov Chain Monte Carlo

Bayesian Model Selection & Extrasolar Planet Detection

Introduction to Bayesian Data Analysis

Bayesian Methods for Machine Learning

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Density Estimation. Seungjin Choi

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Bayesian Learning (II)

Hierarchical Bayesian Modeling of Planet Populations

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Introduction to Probabilistic Machine Learning

Frequentist-Bayesian Model Comparisons: A Simple Example

Statistical Data Analysis Stat 3: p-values, parameter estimation

David Giles Bayesian Econometrics

CSC321 Lecture 18: Learning Probabilistic Models

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

A Bayesian Analysis of Extrasolar Planet Data for HD 73526

Statistical Methods in Particle Physics

SAMSI Astrostatistics Tutorial. Models with Gaussian Uncertainties (lecture 2)

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Advanced Statistical Methods. Lecture 6

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Frequentist Statistics and Hypothesis Testing Spring

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Algorithm-Independent Learning Issues

Astronomy 122 Midterm

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Two Lectures: Bayesian Inference in a Nutshell and The Perils & Promise of Statistics With Large Data Sets & Complicated Models

Introduction to Bayesian Inference

Bayesian Models in Machine Learning

Journeys of an Accidental Statistician

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Stat 451 Lecture Notes Monte Carlo Integration

Introduction to Bayesian Inference: Supplemental Topics

an introduction to bayesian inference

Hierarchical Models & Bayesian Model Selection

Teaching Bayesian Model Comparison With the Three-sided Coin

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

(1) Introduction to Bayesian statistics

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

EXONEST The Exoplanetary Explorer. Kevin H. Knuth and Ben Placek Department of Physics University at Albany (SUNY) Albany NY

Computational Perception. Bayesian Inference

Statistical Methods for Particle Physics Lecture 4: discovery, exclusion limits

COS513 LECTURE 8 STATISTICAL CONCEPTS

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Naïve Bayes classification

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Bayesian methods in economics and finance

Review: Statistical Model

The posterior - the goal of Bayesian inference

New Bayesian methods for model comparison

Physics 509: Bootstrap and Robust Parameter Estimation

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

10. Composite Hypothesis Testing. ECE 830, Spring 2014

A COMPARISON OF LEAST-SQUARES AND BAYESIAN FITTING TECHNIQUES TO RADIAL VELOCITY DATA SETS

P Values and Nuisance Parameters

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Estimation of Quantiles

Convergence Diagnostics For Markov chain Monte Carlo. Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 9, 2017

Astronomy 210 Midterm #2

Lecture 5: Bayes pt. 1

COMP90051 Statistical Machine Learning

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

A Bayesian perspective on GMM and IV

STA 4273H: Sta-s-cal Machine Learning

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

Bayesian RL Seminar. Chris Mansley September 9, 2008

Inference of cluster distance and geometry from astrometry

David Giles Bayesian Econometrics

7. Estimation and hypothesis testing. Objective. Recommended reading

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Introduction to Bayesian Inference Lecture 1: Fundamentals

Bayesian analysis in nuclear physics

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Transcription:

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4) Tom Loredo Dept. of Astronomy, Cornell University http://www.astro.cornell.edu/staff/loredo/bayes/ Bayesian use of sample space integrals p.1/38

The Frequentist Outlook Probabilities for hypotheses are meaningless because hypotheses are not random variables. Data are random, so only probabilities for data can appear in calculations. These probabilities must be interpreted as long-run frequencies. Seek to identify procedures that have good behavior in the long run. What is good for the long run is good for the case at hand. p.2/38

The Bayesian Outlook Quantify information about the case at hand as completely and consistently as possible. No explicit regard given to long run performance. But a result that claims to be optimal in each case should behave well in the long run. Is what is good for the case at hand also good for the long run? p.3/38

Long Run Behavior of Bayesian Methods Bayesian calibration Agenda Consistency & convergence of Bayesian methods p.4/38

Bayesian Calibration Credible region (D) with probability P : P = (D) dθ p(θ I) p(d θ, I) p(d I) What fraction of the time, Q, will the true θ be in (D)? 1. Draw θ from p(θ I) 2. Simulate data from p(d θ, I) 3. Calculate (D) and see if θ (D) Q = dθ p(θ I) dd p(d θ, I) [θ (D)] p.5/38

Q = dθ p(θ I) dd p(d θ, I) [θ (D)] Note appearance of p(θ, D I) = p(θ D, I)p(D I): Q = = = P = P dd dθ p(θ D, I) p(d I) [θ (D)] dd p(d I) dθ p(θ D, I) dd p(d I) (D) Bayesian inferences are calibrated. Always. Calibration is with respect to choice of prior & L. p.6/38

Real-Life Confidence Regions Theoretical (frequentist) confidence regions: A rule δ(d) gives a region with covering probability: C δ (θ) = dd p(d θ, I) [θ δ(d)] It s a confidence region iff C(θ) = P, a constant. Such rules almost never exist in practice! The CR requirement is often relaxed: require C(θ) P (conservative). The actual coverage of many standard regions thus fluctuates (even for coin flipping Brown et al. 2000). p.7/38

Average coverage: Intuition suggests reporting some kind of average performance: dθ f(θ)c δ (θ) Recall the Bayesian calibration condition: P = dθ p(θ I) dd p(d θ, I) [θ (D)] = dθ p(θ I) C δ (θ) provided we take δ(d) = (D). If C (θ) = P, the credible region is a confidence region. Otherwise, the credible region s probability content accounts for a priori uncertainty in θ we need priors for this. p.8/38

Calibration for Bayesian Model Comparison Assign prior probabilities to N M different models. Choose as the true model that with the highest posterior probability, but only if the probability exceeds P crit. Iterate via Monte Carlo: 1.Choose a model by sampling from the model prior. 2.Choose parameters for that model by sampling from the parameter prior pdf. 3.Sample data from that model s sampling distribution conditioned on the chosen parameters. 4.Calculate the posteriors for all the models; choose the most probable if its P > P crit. Will be correct 100P crit % of the time that we reach a conclusion in the Monte Carlo experiment. p.9/38

Robustness to model prior: What if model frequencies model priors? Choose between two models based on the Bayes factor, B, but let them occur with nonequal frequencies. Let [ p(m1 I) γ = max p(m 2 I), p(m ] 2 I) p(m 1 I) Fraction of time a correct conclusion is made if we require B > B crit or B < 1/B crit is Q > 1 1 + γ B crit E.g., if B crit = 100: Correct 99% if γ = 1 Correct 91% if γ = 9 p.10/38

A Worry: Incorrect Models What if none of the models is true? Comfort from experience: Rarely are statistical models precisely true, yet standard models have proved themselves adequate in applications. Comfort from probabilists: Studies of consistency in the framework of nonparametric Bayesian inference show good priors are those that are approximately right for most densities; parametric priors [e.g., histograms] are often good enough (Lavine 1994). One should worry somewhat, but there is not yet any theory providing a consistent, quantitative model failure alert. p.11/38

Bayesian Consistency & Convergence Parameter Estimation: Estimates are consistent if the prior doesn t exclude the true value. Credible regions found with flat priors are typically confidence regions to O(n 1/2 ). Using standard nonuniform reference priors can improve their performance to O(n 1 ). For handling nuisance parameters, regions based on marginal likelihoods have superior long-run performance to regions found with conventional frequentist methods like profile likelihood. p.12/38

Model Comparison: Model comparison is asymptotically consistent. Popular frequentist procedures (e.g., χ 2 test, asymptotic likelihood ratio ( χ 2 ), AIC) are not. For separate (not nested) models, the posterior probability for the true model converges to 1 exponentially quickly. When selecting between more than 2 models, carrying out multiple frequentist significance tests can give misleading results. Bayes factors continue to function well. p.13/38

Summary Parametric Bayesian methods are typically excellent frequentist methods! Not too surprising methods that claim to be optimal for each individual case should be good in the long run, too. p.14/38

Bayesian Adaptive Exploration Prior information & data Combined information Strategy New data Predictions Strategy Observ n Inference Design Theory Decision theory Experimental design Interim results Proof of concept: Exoplanets Motivation: SIM EPIc Survey Demonstration: A few BAE cycles Challenges p.15/38

Bayesian Decision Theory Decisions depend on consequences Might bet on an improbable outcome provided the payoff is large if it occurs and the loss is small if it doesn t. Utility and loss functions Compare consequences via utility quantifying the benefits of a decision, or via loss quantifying costs. Utility = U(c, o) Choice of action (decide b/t these) Outcome (what we are uncertain of) p.16/38

Deciding amidst uncertainty We are uncertain of what the outcome will be average: EU(c) = P (o I) U(c, o) outcomes The best choice maximizes the expected utility: ĉ = arg max c EU(c) p.17/38

Basic principles Bayesian Experimental Design Choices = {e}, possible experiments (sample times, sample sizes...). Outcomes = {d}, values of future data. Utility balances value of d for achieving experiment goals against the cost of the experiment. Choose the experiment that maximizes EU(e) = d p(d e, I) U(e, d) To predict d we must know which of several hypothetical states of nature H i is true. Average over H i : EU(e) = H i p(h i I) d p(d H i, e, I) U(e, d) p.18/38

Information as Utility Common goal: discern among the H i. Utility = information I(e, d) in p(h i d, e, I): U(e, d) = H i p(h i d, e, I) log [p(h i d, e, I)] = Entropy of posterior Design to maximize expected information. p.19/38

Measuring Information With Entropy Entropy of a Gaussian p(x) e (x µ)2 /2σ 2 I log(σ) p( x) exp [ 1 2 x V 1 x ] I log(det V) Entropy measures volume, not width p(x) p(x) x x These distributions have the same entropy/amount of information. p.20/38

Finding Exoplanets: The Space Interferometry Mission SIM in 2009 (?) The Sun s Wobble From 10 pc 1000 500 0-500 -1000-1000 -500 0 500 1000 p.21/38

Tier 1 EPIcS: Extrasolar Planet Interferometric Survey Goal: Identify Earth-like planets in habitable regions around nearby Sun-like stars Requires 1 µas astrometry Long integration times Astrometrically stable reference stars 75 MS stars within 10 pc, 70 epochs per target p.22/38

Tier 2 Goal: Explore the nature and evolution of planetary systems in their full variety Requires 4 µas astrometry, short integration times 1000 targets, piggyback on Tier 1 Preparatory observing High precision radial velocity and adaptive optics observing Identify science targets Identify reference stars (K giants? eccentric binaries?) Huge resource expenditures must optimize use of resources p.23/38

Example: Orbit Estimation With Radial Velocity Observations Data are Kepler velocity plus noise: d i = V (t i ; τ, e, K) + e i 3 remaining geometrical params (t 0, λ, i) are fixed. Noise probability is Gaussian with known σ = 8 m s 1. Simulate data with typical Jupiter-like exoplanet parameters: τ = 800 d e = 0.5 K = 50 ms 1 Goal: Estimate parameters τ, e and K. p.24/38

Cycle 1: Observation Prior setup stage specifies 10 equispaced observations. p.25/38

Use flat priors, Cycle 1: Inference p(τ, e, K D, I) exp[ Q(τ, e, K)/2σ 2 ] Q = sum of squared residuals using best-fit amplitudes. Generate {τ j, e j, K j } via posterior sampling. p.26/38

Aside: Kepler Periodograms Keplerian radial velocity model: V (t) = A 1 + A 2 [e + cos υ(t)] + A 3 sin υ(t) υ(t) = f(t; τ, e, T ) via Kepler s eqn Period τ and 2 other nonlinear parameters (e, T ) 3 linear amplitudes (COM velocity, orbital velocity, λ) Use Bretthorst algorithm. For e = 0 L-S periodogram, the current standard tool, but the Bayesian generalization accounts for orbital eccentricity. For astrometry, 2D data require x(t), y(t). Extra parameters: inclination, parallax, proper motion. p.27/38

Cycle 1: Design Predict value of future datum at t p(d t, D, I) = dτ de dk p(τ, e, K D, I) 1 ) ( σ 2π exp [d v(t; τ, e, K)]2 2σ 2 1 1 ( N σ 2π exp [d v(t; τ ) j, e j, K j )] 2 2σ 2 {τ j,e j,k j } Effect of a datum on inferences Information if we sample at t and get datum d: I(d, t) = dτ de dk p(τ, e, K d, t, D, I) log[p(τ, e, K d, t, D, I)] p.28/38

Average over unknown datum value Expected information: EI(t) = dd p(d t, D, I) I(d, t) Width of noise dist n is independent of value of the signal EI(t) = dd p(d t, D, I) log[p(d t, D, I)] Maximum entropy sampling. (Sebastiani & Wynn 1997, 2000) Evaluate by Monte Carlo using posterior samples & data samples. Pick t to maximize EI(t). p.29/38

Design Results: Predictions, Entropy p.30/38

Cycle 2: Observation p.31/38

Cycle 2: Inference p.32/38

Cycle 2: Design p.33/38

Evolution of Inferences Cycle 1 (10 samples) Cycle 2 (11 samples) p.34/38

Cycle 3 (12 samples) Cycle 4 (13 samples) p.35/38

Evolving goals for inference Challenges Goal may originally be detection (model comparison), then estimation. How are these related? How/when to switch? Generalizing the utility function Cost of a sample vs. time or costs of samples of different size could enter utility. How many bits is an observation worth? Computational algorithms Are there MCMC algorithms uniquely suited to adaptive exploration? When is it smart to linearize? p.36/38

Design for the setup cycle What should the size of a setup sample be? Can the same algorithms be used for setup design? When is it worth the effort? p.37/38

Key Ideas Sample space integrals are useful in a Bayesian setting. Long run behavior of Bayesian methods Bayesian methods are calibrated Parametric Bayesian methods have good frequentist behavior Bayes may be the best way to be frequentist! Bayesian adaptive exploration Can provide dramatic benefits in nonlinear settings Many challenges and open questions p.38/38