Why Try Bayesian Methods? (Lecture 5)

Similar documents
Introduction to Bayesian Inference: Supplemental Topics

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

Bayesian Inference in Astronomy & Astrophysics A Short Course

Statistical Data Analysis Stat 3: p-values, parameter estimation

Bayesian inference in a nutshell

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Statistical Methods for Astronomy

1 Hypothesis Testing and Model Selection

Statistical techniques for data analysis in Cosmology

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Bayesian Inference: Supplemental Topics

Another Statistical Paradox

Primer on statistics:

Statistical Inference

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Statistical Methods in Particle Physics

7. Estimation and hypothesis testing. Objective. Recommended reading

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Introduction to Bayesian Inference Lecture 1: Fundamentals

Solution: chapter 2, problem 5, part a:

Advanced Statistical Methods. Lecture 6

Lecture 1: Probability Fundamentals

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Journeys of an Accidental Statistician

Bayesian Models in Machine Learning

Two Lectures: Bayesian Inference in a Nutshell and The Perils & Promise of Statistics With Large Data Sets & Complicated Models

Statistical Methods for Astronomy

the unification of statistics its uses in practice and its role in Objective Bayesian Analysis:

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Bayesian Inference: What, and Why?

Second Workshop, Third Summary

Introduction to Bayesian Data Analysis

AST 418/518 Instrumentation and Statistics

Hypothesis testing (cont d)

Line Broadening. φ(ν) = Γ/4π 2 (ν ν 0 ) 2 + (Γ/4π) 2, (3) where now Γ = γ +2ν col includes contributions from both natural broadening and collisions.

Computational Cognitive Science

Strong Lens Modeling (II): Statistical Methods

g-priors for Linear Regression

Introduction to Bayesian Inference Lecture 1: Fundamentals

Hypothesis testing. Chapter Formulating a hypothesis. 7.2 Testing if the hypothesis agrees with data

Frequentist Statistics and Hypothesis Testing Spring

Introduction to Probabilistic Reasoning

An abridged version, without sections 5.3 and 6, and with other sections shortened, is in THE PROMISE OF BAYESIAN INFERENCE FOR ASTROPHYSICS

Bayesian analysis in nuclear physics

Chapter 5: HYPOTHESIS TESTING

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA

Probability, Entropy, and Inference / More About Inference

Mathematical Statistics

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Introduction to Bayesian Inference

Introduction to Statistical Inference

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Introduction to Bayes

Statistical Distribution Assumptions of General Linear Models

Physics 403. Segev BenZvi. Classical Hypothesis Testing: The Likelihood Ratio Test. Department of Physics and Astronomy University of Rochester

Bayesian Hypothesis Testing: Redux

Probability is related to uncertainty and not (only) to the results of repeated experiments

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Harrison B. Prosper. CMS Statistics Committee

Part III. A Decision-Theoretic Approach and Bayesian testing

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

Bayesian Statistics. State University of New York at Buffalo. From the SelectedWorks of Joseph Lucke. Joseph F. Lucke

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Constructing Ensembles of Pseudo-Experiments

Probability Theory for Machine Learning. Chris Cremer September 2015

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian Inference. Chris Mathys Wellcome Trust Centre for Neuroimaging UCL. London SPM Course

Statistical Methods for Particle Physics Lecture 4: discovery, exclusion limits

Nuisance parameters and their treatment

Introduction to Bayesian inference Lecture 1: Fundamentals

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Confidence Intervals for Normal Data Spring 2014

Statistics of Small Signals

Systematic uncertainties in statistical data analysis for particle physics. DESY Seminar Hamburg, 31 March, 2009

2. A Basic Statistical Toolbox

(1) Introduction to Bayesian statistics

Introduction to Probability and Statistics (Continued)

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Integrated Objective Bayesian Estimation and Hypothesis Testing

Terminology. Experiment = Prior = Posterior =

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Probability theory basics

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

P Values and Nuisance Parameters

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Hierarchical Bayesian Modeling

Reading for Lecture 6 Release v10

Probabilistic Graphical Models

CSCE 478/878 Lecture 6: Bayesian Learning

Fundamental Probability and Statistics

Transcription:

Why Try Bayesian Methods? (Lecture 5) Tom Loredo Dept. of Astronomy, Cornell University http://www.astro.cornell.edu/staff/loredo/bayes/ p.1/28

Today s Lecture Problems you avoid Ambiguity in what is random Recognizable subsets The nuisance of nuisance parameters Misleading measures of significance Irrelevance of long run behavior/sample averages Advantages you gain Foundations p.2/28

Theory (H 0 ): The Randomness of Randomness The number of A stars in a cluster should be 0.1 of the total. Observations: 5 A stars found out of 96 total stars observed. Theorist s analysis: Calculate χ 2 using n A = 9.6 and n X = 86.4. Significance level is p(> χ 2 H 0 ) = 0.12 (or 0.07 using more rigorous binomial tail area). Theory is accepted. p.3/28

Observer s analysis: Actual observing plan was to keep observing until 5 A stars seen! Random quantity is N tot, not n A ; it should follow the negative binomial dist n. Expect N tot = 50 ± 21. p(> χ 2 H 0 ) = 0.03. Theory is rejected. Telescope technician s analysis: A storm was coming in, so the observations would have ended whether 5 A stars had been seen or not. The proper ensemble should take into account p(storm)... Bayesian analysis: The Bayes factor is the same for binomial or negative binomial likelihoods, and slightly favors H 0. Include p(storm) if you want it will drop out! p.4/28

Recognizable Subsets p(t) t 0 t Goal: Locate the start time, t 0, of a burst of events with an abrupt start and a decay time τ = 1 s. p(t t 0 ) = 1 τ exp( t t 0 τ ) for t t 0 p.5/28

Frequentist method of moments: Note t t p(t)dt = t 0 + τ so unbiased estimator for t 0 is ˆt 0 1 N N (t i τ) i=1 Also, p(ˆt 0 t 0 ) is analytic find confidence intervals. Suppose {t i } = {12, 14, 16}. Find ˆt 0 = 13; 12.15 < t 0 < 13.83 (90%) But ˆt and entire interval lie after first event! Can show that the confidence region will not include the true value 100% of the time in the subset of samples that have ˆt > t 1 + 0.85, and we can tell from the data whether or not any particular sample lies in this subset. p.6/28

Bayesian solution: Product of event time probabilities and flat prior p(t 0 {t i }, I) = N exp [N(t 0 t 1 )] for t 0 t 1 = 0 for t 0 > t 1 Summaries: ˆt 0 = 12; t 0 = 11.66 90% HPD region is 11.23 < t 0 < 12.0 p.7/28

General problem: Marginals vs. Profiles Measure several quantities with an uncalibrated instrument that adds Gaussian noise with unknown but constant σ. Learn about σ by pooling the data. Example Pairs of measurements: Make 2 measurements (x i, y i ) for each of N quantities µ i. ] ] L({µ i }, σ) = exp [ (x i µ i ) 2 exp [ (y i µ i ) 2 2σ 2 σ 2σ 2 2π σ 2π i Frequentist approach uses L p (σ) = max {µi } L({µ i }, σ) But p(σ D) and L p (σ) differ dramatically! p.8/28

Joint & Marginal Results for σ = 1 p.9/28

A Simple Significance Test Model: x i = µ + ɛ i, (i = 1 to n) ɛ i N(0, σ 2 ) Null hypothesis, H 0 : µ = µ 0 = 0 Test statistic: t(x) = x σ/ n p.10/28

The Significance of Significance Collect the α values from a large number of tests in situations where the truth eventually became known, and determine how often H 0 is true at various α levels. Suppose that, overall, H 0 was true about half of the time. Focus on the subset with t 2 (say, [1.95, 2.05] so α [.04,.05], so that H 0 was rejected at the 0.05 level. Find out how many times in that subset H 0 turned out to be true. Do the same for other significance levels. p.11/28

A Monte Carlo experiment: Choose µ = 0 OR µ N(0, 4σ 2 ) with a fair coin flip Simulate n = 20 data, x i N(µ, σ 2 ) Calculate t obs = x σ/ n and α(t obs) = P (t > t obs µ = 0) Bin α(t) separately for each hypothesis; repeat Compare how often the two hypotheses produce data with a 2 or 3 σ effect. p.12/28

Significance Level Frequencies, n = 20 bin size Null hypothesis Alternative p.13/28

Significance Level Frequencies, n = 200 p.14/28

Significance Level Frequencies, n = 2000 p.15/28

What about another µ prior? For data sets with H 0 rejected at α 0.05, H 0 will be true at least 23% of the time (and typically close to 50%). (Edwards et al. 1963; Berger and Selke 1987) At α 0.01, H 0 will be true at least 7% of the time (and typically close to 15%). What about a different true null frequency? If the null is initially true 90% of the time (as has been estimated in some disciplines), for data producing α 0.05, the null is true at least 72% of the time, and typically over 90%. In addition... At a fixed α, the proportion of the time H 0 is falsely rejected grows as n. (Jeffreys 1939; Lindley 1957) Similar results hold generically; e.g., for χ 2. (Delampady & Berger 1990) p.16/28

Significance is not an easily interpretable measure of the weight of evidence against the null. Significance does not accurately measure how often the null will be wrongly rejected among similar data sets. The obvious (and recommended!) interpretation overestimates the evidence. For fixed significance, the weight of the evidence decreases with increasing sample size. p.17/28

A History of Criticism Significance tests have been under suspicion since their creation Some difficulties of interpretation encountered in the application of the chi-square test (Berkson 1938) Jeffreys 1939ff: the n effect (Jeffreys-Lindley paradox) W. L. Thompson s on-line bibliography of works criticizing significance tests contains > 300 entries spanning a dozen disciplines! Significance tests die hard: the amazing persistence of a probabilistic misconception Needed: A ban on the significance test The insignificance of statistical significance p.18/28

H. Jeffreys, addressing an audience of statisticians: For n from about 10 to 500 the usual result is that K = 1 when (a α 0 )/s α is about 2... not far from the rough rule long known to astronomers, i.e. that differences up to twice the standard error usually disappear when more or better observations become available... I have always considered the arguments for the use of P absurd. They amount to saying that a hypothesis that may or may not be true is rejected because a greater departure from the [observed] trial was improbable; that is, that it has not predicted something that has not happened. As an argument astronomer s experience is far better. (Jeffreys 1980) p.19/28

A Bayesian Look at Significance B p({x i} H 1 ) p({x i } H 0 ) = p(α obs H 1 ) p(α obs H 0 ) B is just the ratio calculated in the Monte Carlo! Why is significance a poor measure of the weight of evidence? We should be comparing hypotheses, not trying to identify rare events. Comparison should use the actual data, not merely membership of the data in some larger set. Significance level conditions on incomplete information. p.20/28

The Weatherman Joint Frequencies of Actual & Predicted Weather Actual Prediction Rain Sun Rain 1/4 1/2 Sun 0 1/4 Forecaster is right only 50% of the time. Observer notes a prediction of Sun every day would be right 75% of the time, and applies for the forecaster s job. Should he get the job? p.21/28

Weatherman: You ll never be in an unpredicted rain. Observer: You ll be in an unpredicted rain 1 day out of 4. The value of an inference lies in its usefulness in the individual case. Long run performance is not an adequate criterion for assessing the usefulness of inferences. p.22/28

What you avoid Hidden subjectivity/arbitrariness Dependence on stopping rules Recognizable subsets Defining number of independent trials in searches Inconsistency & incoherence (e.g., inadmissable estimators) Inconsistency with prior information Complexity of interpretation (e.g., significance vs. sample size) p.23/28

What you get Probabilities for hypotheses Straightforward interpretation Identify weak experiments Crucial for global (hierarchical) analyses (e.g., pop n studies) Forces analyst to be explicit about assumptions Handle nuisance parameters Valid for all sample sizes Handles multimodality Quantitative Occam s razor Model comparison for > 2 alternatives; needn t be nested p.24/28

And there s more... Use prior info/combine experiments Systematic error treatable Straightforward experimental design Good frequentist properties: Consistent Calibrated E.g., if you choose a model only if odds > 100, you will be right 99% of the time Coverage as good or better than common methods Unity/simplicity p.25/28

Foundations Many Ways To Bayes Consistency with logic + internal consistency BI (Cox; Jaynes; Garrett) Coherence /Optimal betting BI (Ramsey; DeFinetti; Wald) Avoiding recognizable subsets BI (Cornfield) Avoiding stopping rule problems L-principle (Birnbaum; Berger & Wolpert) Algorithmic information theory BI (Rissanen; Wallace & Freeman) Optimal information processing BI (Good; Zellner) There is probably something to all of this! p.26/28

What the theorems mean When reporting numbers ordering hypotheses, values must be consistent with calculus of probabilities for hypotheses. Many frequentist methods satisfy this requirement. Role of priors Priors are not fundamental! Priors are analogous to initial conditions for ODEs. Sometimes crucial Sometimes a nuisance p.27/28

Key Ideas To make inferences, calculate probabilities for hypotheses What s different about this approach: Must be explicit about alternatives (think about models instead of statistics) Sum/integrate in parameter space rather than sample space This avoids a variety of frequentist difficulties This provides many practical benefits It s what one should do p.28/28