Bayesian Statistics: hierarchical models. Sanjib Sharma (University of Sydney)

Similar documents
Principles of Bayesian Inference

Markov Chain Monte Carlo methods

Bayesian Methods for Machine Learning

Hierarchical Bayesian Modeling

Lecture 8: Bayesian Estimation of Parameters in State Space Models

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Metropolis-Hastings Algorithm

an introduction to bayesian inference

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Bayesian Inference and MCMC

STA 4273H: Statistical Machine Learning

The Metropolis-Hastings Algorithm. June 8, 2012

DAG models and Markov Chain Monte Carlo methods a short overview

Tutorial on Probabilistic Programming with PyMC3

Inference in Bayesian Networks

MCMC: Markov Chain Monte Carlo

Lecture 6: Graphical Models: Learning

David Giles Bayesian Econometrics

Example: Ground Motion Attenuation

MCMC notes by Mark Holder

Introduction into Bayesian statistics

Markov Chain Monte Carlo methods

Machine Learning Summer School

Notes on pseudo-marginal methods, variational Bayes and ABC

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Principles of Bayesian Inference


Basic math for biology

Sampling Algorithms for Probabilistic Graphical models

Probabilistic Graphical Networks: Definitions and Basic Results

Markov Chain Monte Carlo Methods

Data Analysis and Uncertainty Part 2: Estimation

Hierarchical Bayesian Modeling of Planet Populations

Variational Inference via Stochastic Backpropagation

Markov Networks.

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Integrated Non-Factorized Variational Inference

Markov Chain Monte Carlo, Numerical Integration

Stochastic Simulation

Principles of Bayesian Inference

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Lecture 2: From Linear Regression to Kalman Filter and Beyond

MCMC Sampling for Bayesian Inference using L1-type Priors

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Introduction to Bayes

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Making rating curves - the Bayesian approach

Bayesian Phylogenetics:

MONTE CARLO METHODS. Hedibert Freitas Lopes

Lecture Notes based on Koop (2003) Bayesian Econometrics

LECTURE 15 Markov chain Monte Carlo

Learning the hyper-parameters. Luca Martino

Variational Scoring of Graphical Model Structures

CPSC 540: Machine Learning

MARKOV CHAIN MONTE CARLO

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Approximate Inference Part 1 of 2

A note on Reversible Jump Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants

Probabilistic Machine Learning

Approximate Inference Part 1 of 2

Lecture 13 Fundamentals of Bayesian Inference

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CS 343: Artificial Intelligence

Markov chain Monte Carlo methods for visual tracking

Density Estimation. Seungjin Choi

Computational statistics

Bayesian Inference in Astronomy & Astrophysics A Short Course

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

A Single Series from the Gibbs Sampler Provides a False Sense of Security

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Multimodal Nested Sampling

Part 1: Expectation Propagation

Bayesian Linear Regression

A Tutorial on Learning with Bayesian Networks

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Afternoon Meeting on Bayesian Computation 2018 University of Reading

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Hmms with variable dimension structures and extensions

Bayesian Machine Learning

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

Bayes Nets: Sampling

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Using Bayes Factors for Model Selection in High-Energy Astrophysics

Abstract. Statistics for Twenty-first Century Astrometry. Bayesian Analysis and Astronomy. Advantages of Bayesian Methods

Lecture notes on Regression: Markov Chain Monte Carlo (MCMC)

Transcription:

Bayesian Statistics: hierarchical models Sanjib Sharma (University of Sydney)

History of Bayesian statistics

Drop a ball on table. Predict its location. Put another ball ask the question is it left or right to the first. Keep doing this. Ultimately, you can narrow down the location to a small area. Initial Belief+ New data Improved Belief Likelihood +prior posterior Posterior becomes prior in the next round.

Laplace 1774 Independely rediscoverded. In words rather than Eq, Probability of a cause given an event is proportional to the probability of the event given its cause. p(c E) ~ p(e C) His friend Bouvard used his method to calculate the masses of Saturn and Jupiter. Laplace offered bets of 11000 to 1 odd and 1million to 1 that they were right to 1% for Saturn and Jupiter. Even now Laplace would have won both bets.

1900-1950 Largely ignored after Laplace till 1950. Theory of probability, 1939 by Harold Jeffrey Main reference. In WW-II, used at Bletchley Park to decode German Enigma cipher. There were conceptual difficulties Role of prior Data is random or model parameter is random

1950 onwards Tide had started to turn in favor of Bayesian methods. Lack of proper tools and computational power main hindrance. Frequentist methods were simpler which made them popular.

Cox 1946 showed that Bayesian prob theory can be derived from two basic rules. p(θ x) ~ p(x θ)p(θ)

1950 onwards Metropolis algorithm.

N interacting particles. A single configuration ω, can be completely specified by giving position and velocity of all the particles. A point in R2N space. E(ω), total energy of the system For system in equilibrium p(ω) ~ exp (- E(ω) / kt ) Computing any thermodynamic property, pressure, energy etc, requires integrals,which are analytically intractable Start with arbitrary config N particles. Move each by a random walk and compute ΔE the change in energy between old and new config If: ΔE < 0, always accept. Else: accept stochastically with probability exp (- ΔE / kt ) Immediate hit in statistical physics.

Hastings 1970 The same method can be used to sample an arbitrary pdf p(ω) by replacing E(ω)/kT ln p(ω) Had to wait till Hastings Generalized the algorithm and derived the essential condition that a Markov chain out to satisfy to sample the target distribution. Acceptance ratio not uniquely specified, other forms exist. His student Peskun 1973 showed that Metropolis gives the fastest mixing rate of the chain

MH algorithm q(y xt) f(x) xt y

1980 Simulated annealing Kirkpatrick 1983 To solve combinatorial optimization problems using MH algorithm using ideas of annealing from solid state physics. Useful when we have multiple maxima Minimize an objective function C(ω) by sampling from exp(-c(ω)/t) with progressively decreasing T. Allowing selection of globally optimum solution

1984 Expectation Maximization (EM) algorithm Dempster 1977 Provided a way to deal with missing data and hidden variables. Hierachical Bayesian models. Vastly increased the range of problems that can addressed by Bayesian methods. Deterministic and sensitive to initial condition. Stochastic versions were developed Data augmentation, Tanner and Wong 1987 Geman and Geman 1984 Introduced Gibbs sampling in the context of image restoration. First proper use of MCMC to solve a problem setup in Bayesian framework.

Image: Ryan Adams

1990 Gelfand and Smith 1990 Largely credited with revolution in statistics, Unified the ideas of Gibbs sampling, DA algorithm and EM algorithm. It firmly established that Gibbs samling and MH based MCMC algorithms can be used to solve a wide class of problems that fall in the category of hierarchical bayesian models.

Citation history of Metropolis et al/ 1953 Physics: well known from 1970-1990 Statistics: only 1990 onwards Astronomy: 2002 onwards

Astronomy's conversion- 2002

Astronomy: 1990-2002 Saha & Williams 1994 Christensen & Meyer 1998 Gravitational wave radiation Christensen et al. 2001 and Knox et al. 2001 Galaxy kinematics from absorption line spectra. Comsological parameter estimation using CMB data Lewis & Bridle 2002 Galvanized the astronomy community more than any other paper.

Lewis & Bridle 2002 Laid out in detail the Bayesian MCMC framework Applied it to one of the most important data sets of the time. Used it to address a significant scientific question- fundamanetal parameters of the universe. Made the code publicly available Making it easier for new entrants.

Bayesian hierarchical models p(θ {xi} ) ~ p(θ) i p( xi θ ) θ x0 x1 xn p(θ, {xi} {yi}) ~ p(θ) i p( xi θ ) p(yi xi, σyi) θ Level-0: Population Level-1: Individual Object-intrinsic x0 x1 xn Level-1: Individual Object-observable y0 y1 yn N

Test score of students from various schools J schools, nj students in j-th school. yij the score of i-th student in j-th school Y = {yij 0<j<J, 0<i<nj} The dispersion σ in score is known a priori. What we want to know Average score of a school and its uncertainty p(αj). Mean of students in a school (no-pool) Overall average of schools and its dispersion (μ,ω). Global mean without segregating into schools.

Average score of a school and its uncertainty p(αj). Mean of students in a school (no-pool) If nj very small, [10,20], estimate is very uncertain Overall average of schools and its dispersion (μ,ω). Global mean without segregating into schools. If one school has too many students, the mean and spread will be dominated by it.

BHM Some schools have very few students so uncertainty very high. p(αj y.j) ~ p(y.j αj,σ) p(αj,σ) There is more information in data from other schools. p(αj μ,ω) ~ p(y.j αj,σ) p(αj μ,ω) But, population statistics depends on schools, they are interrelated. We get joint info about population of schools as well as for individual schools. p(α,μ,ω Y) ~ p(μ,ω) j p(αj μ,ω) i p(yij αj,σ)

Global mean μ=0,ω=1, J=40, σ =1 and 2<nj<20. Blue - standard group mean (based on data from group) Green - BHM based group mean Green points are systematically closer to the global mean than the blue points. The shift is more for cases where error bars are larger The BHM estimates have smaller error bars. BHM also makes use of information from other groups.

Handling uncertainties p(θ, {xti} {xi}, {σxi} ) ~ p(θ) i p( xti θ ) p(xi xti, σx,i) p(xi xti, σx,i) ~ Normal( xi xti, σyi) Level-0: Population θ Level-1: Individual Object-intrinsic xt0 xt1 xtn Level-1: Individual Object-observable x0 x1 xn N

Kinematic modelling of MW p(θ, vl',vb',vr', s' vl,vb,vr, l,b,s) ~ p(vl',vb',vr' θ,l,b,s') * p(s s',σs) p(vr vr') p(vl vl')p(vb vb') 6d data from GAIA

Missing variables p(θ, {xti} {xi}, {σxi} ) ~ p(θ) i p( xti θ ) p(xi xti, σx,i) p(xi xti, σx,i) ~ Normal( xi xti, σxi) Certain σxi Level-0: Population θ Level-1: Individual Object-intrinsic xt0 xt1 xtn Level-1: Individual Object-observable x0 x1 xn N

Kinematic modelling of MW p(θ, vl',vb',vr', s' vr, l,b,s) ~ p(vl',vb',vr' θ,l,b,s') * p(s s',σs) p(vr vr') Spectroscopic data no proper motion.

Hidden variables p(θ, {xi} {yi}, {σyi} ) ~ p(θ) i p( xi θ ) p(yi xi, σyi) A function y(x) exists for mapping x y p(yi xi, σyi) ~ Normal( yi y(xi), σyi) Level-0: Population θ Level-1: Individual Object-intrinsic x0 x1 xn Level-1: Individual Object-observable y0 y1 yn N

Exoplanets

Mean velocity of center of mass v0 Semi-amplitude Time period T Eccentricity e Angle of pericenter from the ascending node Time of passage through the pericenter t

Hogg et al 2010

Hogg et al 2010

3d Extinction- EB-V(s) Pan-STARRS 1 and 2MASS Green et al. 2015

3d Extinction maps Obsevables: y=(j,h,k,teff,log g, [M/H],l,b) Intrinsic params: x=([m/h],t,m,s,l,b,e) Distance extinction relationship: p(e s,θ) Posterior: p(x y,σy, α) ~ p(y x, σy) p(x θ) p(θ) Likelihood: p(y x, σy) ~ N(y-y(x),σy) Prior: p(x θ) ~ p(t)p(m)p([m/h])p(s) p(e s,θ) p(θ)

How to solve BHM models Two step: Hogg et al. 2010 p( θ {yi}, σy ) ~ p(α) i dxi p( yi xi, σy,i) p( xi θ ) ~ p(α) i dxi p( yi xi, σy,i ) p(xi) [p(xi θ ) / p( xi )] ~ p(α) i k [ p( xik θ ) / p( xik ) ] xik sampled from p( yi xi, σy,i ) p( xi ) MWG: p( θ, {xi} {yi}, {σyi} ) ~ p(θ) i p( yi xi, σy,i ) p(xi θ)

MH algorithm q(y xt) f(x) xt y

Image: Ryan Adams

Metropolis Within Gibbs Gibbs sampler requires sampling from conditional distribution. Replace this with a MH step. Rather than updating all at one time, one can A compliacated distribution can be broken up into sequence of smaller of easier to samplings is the main strength of this.

BMCMC- a python package pip install bmcmc https://github.com/sanjibs/bmcmc Ability to solve hierarchical Bayesian models. Documentation: http://bmcmc.readthedocs.io/en/latest/

Summary Hierarchical Bayesian models allow you to tackle a wide range of problems in astronomy. For more info and Monte Carlo based algorithms to solve Bayesian inference problems see, Sharma 2017 Annual Reviews of Astronomy and Astrophysics,