Bayesian inference: what it means and why we care

Similar documents
A Very Brief Summary of Bayesian Inference, and Examples

Introduction to Bayesian Methods

Introduction to Bayesian Statistics 1

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Bayesian Phylogenetics:

Hierarchical Models & Bayesian Model Selection

Minimum Message Length Analysis of the Behrens Fisher Problem

Time Series and Dynamic Models

STAT 425: Introduction to Bayesian Analysis

A Very Brief Summary of Statistical Inference, and Examples

Topic 12 Overview of Estimation

Monte Carlo in Bayesian Statistics

Part III. A Decision-Theoretic Approach and Bayesian testing

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Inference: Posterior Intervals

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Lecture 2: Statistical Decision Theory (Part I)

Chapter 5. Bayesian Statistics

Bayesian Inference and MCMC

an introduction to bayesian inference

Bayesian Inference in Astronomy & Astrophysics A Short Course

Introduction to Probabilistic Machine Learning

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Lecture : Probabilistic Machine Learning

David Giles Bayesian Econometrics

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Learning Bayesian network : Given structure and completely observed data

Principles of Bayesian Inference

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Markov Chain Monte Carlo methods

INTRODUCTION TO BAYESIAN STATISTICS

ST 740: Model Selection

7. Estimation and hypothesis testing. Objective. Recommended reading

Foundations of Statistical Inference

Density Estimation. Seungjin Choi

Statistics: Learning models from data

Strong Lens Modeling (II): Statistical Methods

(1) Introduction to Bayesian statistics

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Probabilistic Machine Learning

Data Analysis and Uncertainty Part 2: Estimation

General Bayesian Inference I

Learning the hyper-parameters. Luca Martino

Lecture 13 Fundamentals of Bayesian Inference

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Linear Models A linear model is defined by the expression

Confidence Distribution

Department of Statistics

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

TDA231. Logistic regression


Introduction into Bayesian statistics

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probability and Estimation. Alan Moses

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian model selection: methodology, computation and applications

Markov Chain Monte Carlo

Bayesian Regression Linear and Logistic Regression

Bayesian RL Seminar. Chris Mansley September 9, 2008

Naïve Bayes classification

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

7. Estimation and hypothesis testing. Objective. Recommended reading

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Seminar über Statistik FS2008: Model Selection

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Approximate Normality, Newton-Raphson, & Multivariate Delta Method

Stat 5101 Lecture Notes

Overall Objective Priors

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayes: All uncertainty is described using probability.

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Beta statistics. Keywords. Bayes theorem. Bayes rule

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

A Bayesian Approach to Phylogenetics

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Integrated Objective Bayesian Estimation and Hypothesis Testing

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Bayesian Inference

Bayesian Methods for Machine Learning

Transcription:

Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Mathématiques de la Décision Université Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 1 / 42

The aim of Statistics In Statistics, we generally care about inferring information about an unknown parameter θ. For instance, we observe X 1,..., X n N (θ, 1) and wish to: Obtain a (point) estimate ˆθ of θ, e.g. ˆθ = 1.3. Measure the uncertainty of our estimator, by obtaining an interval or region of plausible values, e.g. [0.9, 1.5] is a 95% confidence interval for θ. Perform model choice/hypothesis testing, e.g. decide between H 0 : θ = 0 and H 1 : θ 0 or between H 0 : X i N (θ, 1) and H 1 : X i E(θ). Use this inference in postprocessing: prediction, decision-making, input of another model... Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 2 / 42

Why be Bayesian? Some application areas make heavy use of Bayesian inference, because: The models are complex Estimating uncertainty is paramount The output of one model is used as the input of another We are interested in complex functions of our parameters Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 3 / 42

Frequentist statistics Statistical inference deals with estimating an unknown parameter θ given some data D. In the frequentist view of statistics, θ has a true fixed (deterministic) value. Uncertainty is measured by confidence intervals, which are not intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20) for θ, I cannot say that there is a 95% probability that θ belongs to the interval [80 ; 120]. Frequentist statistics often use the maximum likelihood estimator: for which value of θ would the data be most likely (under our model)? L(θ D) = P[D θ] ˆθ = arg max L(θ D) θ Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 4 / 42

Bayes rule Recall Bayes rule: for two events A and B, we have P[A B] = P[B A]P[A]. P[B] Alternatively, with marginal and conditional densities: π(y x) = π(x y)π(y). π(x) Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 5 / 42

Bayesian statistics In the Bayesian framework, the parameter θ is seen as inherently random: it has a distribution. Before I see any data, I have a prior distribution on π(θ), usually uninformative. Once I take the data into account, I get a posterior distribution, which is hopefully more informative. By Bayes rule, π(θ D) = π(d θ)π(θ). π(d) By definition, π(d θ) = L(θ D). The quantity π(d) is a normalizing constant with respect to θ, so we usually do not include it and write instead π(θ D) π(θ)l(θ D). Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 6 / 42

Bayesian statistics π(θ D) π(θ)l(θ D) Different people have different priors, hence different posteriors. But with enough data, the choice of prior matters little. We are now allowed to make probability statements about θ, such as there is a 95% probability that θ belongs to the interval [78 ; 119] (credible interval). Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 7 / 42

Advantages and drawbacks of Bayesian statistics More intuitive interpretation of the results Easier to think about uncertainty In a hierarchical setting, it becomes easier to take into account all the sources of variability Prior specification: need to check that changing your prior does not change your result Computationally intensive Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 8 / 42

Example: Bernoulli Take X i Bernoulli(θ), i.e. P[X i = 1] = θ P[X i = 0] = 1 θ. Possible prior: θ U([0, 1]): π(θ) = 1 for 0 θ 1. Likelihood: L(θ X i ) = θ X i (1 θ) 1 X i L(θ X 1,..., X n ) = θ X i (1 θ) n X i Posterior, with S n = n i=1 X i: = θ Sn (1 θ) n Sn π(θ X 1,..., X n ) 1 θ Sn (1 θ) n Sn We can compute the normalizing constant analytically: π(θ X 1,..., X n ) = (n + 1)! S n!(n S n )! θsn (1 θ) n Sn Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 9 / 42

Conjugate prior Suppose we take the prior θ Beta(α, β): π(θ) = Then the posterior verifies hence Γ(α + β) Γ(α)Γ(β) θα 1 (1 θ) β 1. π(θ X 1,..., X n ) θ α 1 (1 θ) β 1 θ Sn (1 θ) n Sn θ X 1,..., X n Beta(α + S n, β + n S n ). Whatever the data, the posterior is in the same family as the prior: we say that the prior is conjugate for this model. This is very convenient mathematically. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 10 / 42

Jeffrey s prior Another possible default prior is Jeffrey s prior, which is invariant by change of variables. Let l be the log-likelihood and I be Fisher s information: I(θ) = E [ ( ) ] dl 2 dθ X P θ Jeffrey s prior is defined by [ d 2 = E dθ 2 l(θ; X ) X P θ π(θ) I(θ). ]. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 11 / 42

Invariance of Jeffrey s prior Let φ be an alternate parameterization of the model. Then the prior induced on φ by Jeffrey s prior on θ is π(φ) = π(θ) dθ dφ ( ) dθ 2 [ ( ) ] I(θ) = dl 2 ( ) dθ 2 [ ( ) ] E = dl dθ 2 E dφ dθ dφ dθ dφ [ ( ) ] = dl 2 E = I(φ) dφ which is Jeffrey s prior on φ. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 12 / 42

Effect of prior Example: Bernoulli model (biased coin). θ=probability of success. Observe S n = 72 successes out of n = 100 trials. Frequentist estimate: ˆθ = 0.72 95% confidence interval: [0.63 0.81]. Bayesian estimate: will depend on the prior. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 13 / 42

Effect of prior prior(x) 0 4 8 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 0 4 8 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 0 4 8 0.0 0.2 0.4 0.6 0.8 1.0 x S n = 72, n = 100 Black:prior; green: likelihood; red: posterior. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 14 / 42

Effect of prior prior(x) 0 4 8 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 0 4 8 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 0 4 8 0.0 0.2 0.4 0.6 0.8 1.0 x S n = 7, n = 10 Black:prior; green: likelihood; red: posterior. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 15 / 42

Effect of prior prior(x) 0 10 25 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 0 10 25 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 0 10 25 0.0 0.2 0.4 0.6 0.8 1.0 x S n = 721, n = 1000 Black:prior; green: likelihood; red: posterior. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 16 / 42

Choosing the prior The choice of the prior distribution can have a large impact, especially if the data are of small to moderate size. How do we choose the prior? Expert knowledge of the application A previous experiment A conjugate prior, i.e. one that is convenient mathematically, with moments chosen by expert knowledge A non-informative prior... In all cases, the best practice is to try several priors, and to see whether the posteriors agree: would the data be enough to make agree experts who disagreed a priori? Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 17 / 42

Example: phylogenetic tree Example from Ryder & Nicholls (2011). Given lexical data, we wish to infer the age of the Most Recent Common Ancestor to the Indo-European languages. Two main hypotheses: Kurgan hypothesis: root age is 6000-6500 years Before Present (BP). Anatolian hypothesis: root age is 8000-9500 years BP Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 18 / 42

Example of a tree Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 19 / 42

Why be Bayesian in this setting? Our model is complex and the likelihood function is not pleasant We are interested in the marginal distribution of the root age Many nuisance parameters: tree topology, internal ages, evolution rates... We want to make sure that our inference procedure does not favour one of the two hypotheses a priori We will use the output as input of other models For the root age, we choose a prior U([5000, 16000]). Prior for the other parameters is out of the scope of this talk. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 20 / 42

Model parameters Parameter space is large: Root age R Tree topology and internal ages g (complex state space) Evolution parameters λ, µ, ρ, κ... The posterior distribution is defined by π(r, g, λ, µ, ρ, κ D) π(r)π(g)π(λ, µ, κ, ρ)l(r, g, λ, µ, κ, ρ D) We are interested in the marginal distribution of R given the data D: π(r D) = π(r, g, λ, µ, ρ, κ D) dg dλ dµ dρ dκ. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 21 / 42

Computation This distribution is not available analytically, nor can we sample from it directly. But we can build a Markov Chain Monte Carlo scheme (see Jalal s talk) to get a sample from the joint posterior distribution of (R, g, λ, µ, ρ, κ) given D. Then keeping only the R component gives us a sample from the marginal posterior distribution of R given D. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 22 / 42

Root age: prior Age of Proto Indo European density 0e+00 1e 04 2e 04 3e 04 4e 04 5e 04 6e 04 7e 04 Kurgan hypothesis Anatolian hypothesis prior posterior 4000 6000 8000 10000 12000 14000 16000 years Before Present Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 23 / 42

Root age: posterior Age of Proto Indo European density 0e+00 1e 04 2e 04 3e 04 4e 04 5e 04 6e 04 7e 04 Kurgan hypothesis Anatolian hypothesis prior posterior 4000 6000 8000 10000 12000 14000 16000 years Before Present Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 24 / 42

Phylogenetics tree of languages: conclusions Strong support for Anatolian hypothesis; no support for Kurgan hypothesis Measuring the uncertainty of the root age estimate is key We integrate out the uncertainty of the nuisance parameters This setting is much easier to handle in the Bayesian setting than in the frequentist setting Computational aspects are complex Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 25 / 42

Air France Flight 447 This section and its figures are after Stone et al. (Statistical Science 2014) Air France Flight 447 disappeared over the Atlantic on 1 June 2009, en route from Rio de Janeiro to Paris; all 228 people on board were killed. The first three search parties did not succeed at retrieving the wreckage or flight recorders. In 2011, a fourth party was launched, based on a Bayesian search. Figure : Flight route. Picture by Mysid, Public Domain. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 26 / 42

Why be Bayesian? Many sources of uncertainties Subjective probabilities The object of interest is a distribution Frequentist formalism does not apply (unique event) Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 27 / 42

Previous searches Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 28 / 42

Prior based on flight dynamics Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 29 / 42

Probabilities derived from drift Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 30 / 42

Posterior Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 31 / 42

Conclusions Once the posterior distribution was derived, the search was organized starting with the areas of highest posterior probability Actually several posteriors, because several models were considered The wreckage was located in one week Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 32 / 42

Point estimates Although one of the main purposes of Bayesian inference is getting a distribution, we can also need to summarize the posterior with a point estimate. Common choices: Posterior mean ˆθ = Maximum a posteriori (MAP) Posterior median... θ π(θ D) dθ ˆθ = arg max π(θ D) Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 33 / 42

Optimality From a frequentist point of view, the posterior expectation is optimal under a certain sense. Let θ be the true value of the parameter of interest, and ˆθ(X ) an estimator. Then the posterior mean minimizes the expectation of the squared error under the prior ] E π [ θ ˆθ(X ) 2 2 For this reason, the posterior mean is also called the minimum mean square error (MMSE) estimator. For other loss functions, other point estimates are optimal. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 34 / 42

2D Ising models (a) Original Image (b) Focused Region of Image Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 35 / 42

2D Ising model Higdon (JASA 1998) Target density Consider a 2D Ising model, with posterior density π(x y) exp α 1I[y i = x i ] + β i i j 1I[x i = x j ] with α = 1, β = 0.7. The first term (likelihood) encourages states x which are similar to the original image y. The second term (prior) favors states x for which neighbouring pixels are equal, like a Potts model. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 36 / 42

2D Ising models: posterior exploration Iteration 300,000 Iteration 350,000 Iteration 400,000 Iteration 450,000 Iteration 500,000 40 X 2 30 20 10 40 30 20 10 Metropolis Hastings Wang Landau Pixel On Off 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 X 1 Figure : Spatial model example: states explored over 500,000 iterations for Metropolis-Hastings (top) and Wang-Landau algorithms (bottom). Figure from Bornn et al. (JCGS 2013). See also Jacob & Ryder (AAP 2014) for more on the algorithm. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 37 / 42

2D Ising models: posterior mean Figure : Spatial model example: average state explored with Wang-Landau after importance sampling. Figure from Bornn et al. (JCGS 2013). See also Jacob & Ryder (AAP 2014) for more on the algorithm. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 38 / 42

Ising model: conclusions Problem-specific prior Even with a point estimate (posterior mean), we measure uncertainty Computational cost is very high Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 39 / 42

Several models Suppose we have several models m 1, m 2,..., m k. Then the model index can be viewed as a parameter. Take a uniform (or other) prior: P[M = m j ] = 1 k. The posterior distribution then gives us the probability associated with each model given the data. We can use this for model choice (but there are other, more sophisticated, techniques) but also for estimation/prediction while integrating out the uncertainty on the model. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 40 / 42

Example: variable selection for linear regression A model is a choice of covariables to include in the regression. With p covariables, there are 2 p models. Classical (frequentist) setting: Select variables, using your favourite penalty, thus selecting one model Perform estimation and prediction within that model If you want error bars, you can compute them, but only within that model Bayesian setting: Explore space of all models Get posterior probabilities Compute estimation and prediction for each model (or, in practice, for those with non negligible probability) Weight these estimates/predictions by the posterior probability of each model The uncertainty about the model is thus fully taken into account. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 41 / 42

Conclusions Bayesian inference is a powerful tool to fully take into account all sources of uncertainty Difficulty of prior specification Computational issues are the main hurdle Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 42 / 42