Hierarchical Models & Bayesian Model Selection

Similar documents
Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian RL Seminar. Chris Mansley September 9, 2008

CPSC 540: Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Computational Cognitive Science

Learning Bayesian network : Given structure and completely observed data

Lecture 2: Priors and Conjugacy

Bayesian Inference. p(y)

COS513 LECTURE 8 STATISTICAL CONCEPTS

θ 1 θ 2 θ n y i1 y i2 y in Hierarchical models (chapter 5) Hierarchical model Introduction to hierarchical models - sometimes called multilevel model

CSC321 Lecture 18: Learning Probabilistic Models

PMR Learning as Inference

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Conjugate Priors, Uninformative Priors

Computational Cognitive Science

Chapter 5. Bayesian Statistics

Bayesian Models in Machine Learning

Introduction to Machine Learning

Probabilistic Graphical Models

Lecture : Probabilistic Machine Learning

Parametric Techniques Lecture 3

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Bayesian Learning (II)

A Very Brief Summary of Bayesian Inference, and Examples

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 6: Model Checking and Selection

an introduction to bayesian inference

Density Estimation. Seungjin Choi

More Spectral Clustering and an Introduction to Conjugacy

7. Estimation and hypothesis testing. Objective. Recommended reading

General Bayesian Inference I

Bayes Factors for Grouped Data

Introduction to Machine Learning. Lecture 2

10. Exchangeability and hierarchical models Objective. Recommended reading

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Introduction to Probabilistic Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Parametric Techniques

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

COMP90051 Statistical Machine Learning

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

g-priors for Linear Regression

Introduction to Bayesian Methods

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Machine Learning

Inference for a Population Proportion

Lecture 2: Conjugate priors

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

Bayesian Methods: Naïve Bayes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Computational Perception. Bayesian Inference

Introduction to Bayesian inference

Probability and Estimation. Alan Moses

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Time Series and Dynamic Models

Bayesian Inference: Concept and Practice

CS-E3210 Machine Learning: Basic Principles

Bayesian methods in economics and finance

David Giles Bayesian Econometrics

LEARNING WITH BAYESIAN NETWORKS

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayesian Inference and MCMC

Discrete Binary Distributions

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Bayesian Methods for Machine Learning

Lecture 3: Machine learning, classification, and generative models

Lecture 13 Fundamentals of Bayesian Inference

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Naïve Bayes classification

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Model Averaging (Bayesian Learning)

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Machine Learning

Bayesian inference: what it means and why we care

CS540 Machine learning L9 Bayesian statistics

Introduction to Bayesian Learning. Machine Learning Fall 2018

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction to Bayesian Inference

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Part III. A Decision-Theoretic Approach and Bayesian testing

Statistical learning. Chapter 20, Sections 1 3 1

Lecture 4: Probabilistic Learning

Introduction to Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Penalized Loss functions for Bayesian Model Choice

Part 7: Hierarchical Modeling

Probabilistic and Bayesian Machine Learning

Transcription:

Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016

Contact information Please report any typos or errors to geoff.roeder@gmail.com

Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case

Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case

Coin toss: point estimates for θ Probability model Consider the experiment of tossing a coin n times. Each toss results in heads with probability θ and tails with probability 1 θ

Coin toss: point estimates for θ Probability model Consider the experiment of tossing a coin n times. Each toss results in heads with probability θ and tails with probability 1 θ Let Y be a random variable denoting number of observed heads in n coin tosses. Then, we can model Y Bin(n, θ), with probability mass function ( ) n p(y = y θ) = θ y (1 θ) n y (1) y

Coin toss: point estimates for θ Probability model Consider the experiment of tossing a coin n times. Each toss results in heads with probability θ and tails with probability 1 θ Let Y be a random variable denoting number of observed heads in n coin tosses. Then, we can model Y Bin(n, θ), with probability mass function ( ) n p(y = y θ) = θ y (1 θ) n y (1) y We want to estimate the parameter θ

Coin toss: point estimates for θ Maximum Likelihood By interpreting p(y = y θ) as a function of θ rather than y, we get the likelihood function for θ

Coin toss: point estimates for θ Maximum Likelihood By interpreting p(y = y θ) as a function of θ rather than y, we get the likelihood function for θ Let ll(θ y) := log p(y θ), the log-likelihood. Then, ˆθ ML = argmaxll(θ y) = argmax ylog(θ) + (n y)log(1 θ) (2) θ θ

Coin toss: point estimates for θ Maximum Likelihood By interpreting p(y = y θ) as a function of θ rather than y, we get the likelihood function for θ Let ll(θ y) := log p(y θ), the log-likelihood. Then, ˆθ ML = argmaxll(θ y) = argmax ylog(θ) + (n y)log(1 θ) (2) θ θ Since the log likelihood is a concave function of θ, argmax l(θ y) 0 = l(θ y) θ θ ˆθ ML 0 = y n y ˆθ ML 1 ˆθ ML ˆθ ML = y n (3)

Coin toss: point estimates for θ Point estimate for θ: Maximum Likelihood What if sample size is small?

Coin toss: point estimates for θ Point estimate for θ: Maximum Likelihood What if sample size is small? Asymptotic result that this approaches true parameter

Coin toss: point estimates for θ Maximum A Posteriori Alternative analysis: reverse the conditioning with Bayes Theorem: p(θ y) = p(y θ)p(θ) p(y) (4) Lets us encode our prior beliefs or knowledge about θ in a prior distribution for the parameter, p(θ)

Coin toss: point estimates for θ Maximum A Posteriori Alternative analysis: reverse the conditioning with Bayes Theorem: p(θ y) = p(y θ)p(θ) p(y) (4) Lets us encode our prior beliefs or knowledge about θ in a prior distribution for the parameter, p(θ) Recall that if p(y θ) is in the exponential family, there exists a conjugate prior p(θ) s.t. if p(θ) F, then p(y θ)p(θ) F

Coin toss: point estimates for θ Maximum A Posteriori Alternative analysis: reverse the conditioning with Bayes Theorem: p(θ y) = p(y θ)p(θ) p(y) (4) Lets us encode our prior beliefs or knowledge about θ in a prior distribution for the parameter, p(θ) Recall that if p(y θ) is in the exponential family, there exists a conjugate prior p(θ) s.t. if p(θ) F, then p(y θ)p(θ) F Saw last time that binomial is in the exponential family, and θ Beta(α, β) is a conjugate prior. p(θ α, β) = Γ(α + β) Γ(α)Γ(β) θα 1 (1 θ) β 1 (5)

Coin toss: point estimates for θ Maximum A Posteriori Moreover, for any given realization y of Y, the marginal distribution p(y) = p(y θ )p(θ )dθ is a constant

Coin toss: point estimates for θ Maximum A Posteriori Moreover, for any given realization y of Y, the marginal distribution p(y) = p(y θ )p(θ )dθ is a constant Thus, p(θ y) p(y θ)p(θ)p(y) so that ˆθ MAP = argmax θ = argmax θ = argmax θ p(θ y) p(y θ)p(θ) log p(y θ)p(θ)

Coin toss: point estimates for θ Maximum A Posteriori Moreover, for any given realization y of Y, the marginal distribution p(y) = p(y θ )p(θ )dθ is a constant Thus, p(θ y) p(y θ)p(θ)p(y) so that ˆθ MAP = argmax θ = argmax θ = argmax θ p(θ y) p(y θ)p(θ) log p(y θ)p(θ) By evaluating the first partial derivative w.r.t θ and setting to 0 at ˆθ MAP we can derive y + α 1 ˆθ MAP = (6) n + β 1 + α 1

Coin toss: point estimates for θ Point estimate for θ: Choosing α, β The point estimate for ˆθ MAP shows choices of α and β correspond to having already seen prior data. Can encode strength of prior belief using these parameters.

Coin toss: point estimates for θ Point estimate for θ: Choosing α, β The point estimate for ˆθ MAP shows choices of α and β correspond to having already seen prior data. Can encode strength of prior belief using these parameters. Can also choose uninformative prior: Jeffreys prior. For beta-binomial model, corresponds to (α, β) = ( 1 2, 1 2 ).

Coin toss: point estimates for θ Point estimate for θ: Choosing α, β The point estimate for ˆθ MAP shows choices of α and β correspond to having already seen prior data. Can encode strength of prior belief using these parameters. Can also choose uninformative prior: Jeffreys prior. For beta-binomial model, corresponds to (α, β) = ( 1 2, 1 2 ). Deriving an analytic form for the posterior is possible also if the prior is conjugate. We saw last week that for a single Binomial experiment with a conjugate Beta, p(θ y) Beta(α + y 1, β + n y 1)

Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case

Hierarchical Models Introduction Putting a prior on the parameter θ was pretty useful

Hierarchical Models Introduction Putting a prior on the parameter θ was pretty useful We ended up with two parameters α and β we could choose to formally encode our knowledge about the random process

Hierarchical Models Introduction Putting a prior on the parameter θ was pretty useful We ended up with two parameters α and β we could choose to formally encode our knowledge about the random process Often, though, we want to go one step further: put a prior on the prior, rather than treating α and β as constants

Hierarchical Models Introduction Putting a prior on the parameter θ was pretty useful We ended up with two parameters α and β we could choose to formally encode our knowledge about the random process Often, though, we want to go one step further: put a prior on the prior, rather than treating α and β as constants Then, θ is a sample from a population distribution

Hierarchical Models Introduction Example: now we have information available at different levels of the observational units

Hierarchical Models Introduction Example: now we have information available at different levels of the observational units At each level the observational units must be exchangeable

Hierarchical Models Introduction Example: now we have information available at different levels of the observational units At each level the observational units must be exchangeable Informally, a joint probability distribution p(y 1,..., y n ) is exchangeable if the indices on the y i can be shuffled without changing the distribution

Hierarchical Models Introduction Example: now we have information available at different levels of the observational units At each level the observational units must be exchangeable Informally, a joint probability distribution p(y 1,..., y n ) is exchangeable if the indices on the y i can be shuffled without changing the distribution Then, a Hierarchical Bayesian model introduces an additional prior distribution for each level of observational unit, allowing additional unobserved parameters to explain some dependencies in the model

Hierarchical Models Introduction Example A clinical trial of a new cancer drug has been designed to compare the five-year survival probability in a population given the new drug to the five-year survival probability in a population under a standard treatment (Gelman et al. [2014]). Suppose the two drugs are administered in separate randomized experiments to patients in different cities.

Hierarchical Models Introduction Example A clinical trial of a new cancer drug has been designed to compare the five-year survival probability in a population given the new drug to the five-year survival probability in a population under a standard treatment (Gelman et al. [2014]). Suppose the two drugs are administered in separate randomized experiments to patients in different cities. Within each city, the patients can be considered exchangeable

Hierarchical Models Introduction Example A clinical trial of a new cancer drug has been designed to compare the five-year survival probability in a population given the new drug to the five-year survival probability in a population under a standard treatment (Gelman et al. [2014]). Suppose the two drugs are administered in separate randomized experiments to patients in different cities. Within each city, the patients can be considered exchangeable The results from different hospitals can also be considered exchangeable

Hierarchical Models Introduction Terminology note: With hierarchical Bayes, we have one set of parameters θ i to model the survival probability of the patients y ij in hospital i, and another set of parameters φ to model the random process governing the generation of θ j

Hierarchical Models Introduction Terminology note: With hierarchical Bayes, we have one set of parameters θ i to model the survival probability of the patients y ij in hospital i, and another set of parameters φ to model the random process governing the generation of θ j Hence, θ i are themselves given a probabilistic specification in terms of hyperparameters φ through a hyperprior p(φ)

Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case

Motivating example: Incidence of tumors in rodents Adapted from Gelman et al. (2014) Let s develop a Hierarchical model using the beta-binomial Bayesian approach seen so far Example Suppose we have the results of a clinical study of a drug in which rodents were exposed to either a dose of the drug or a control treatment (no dose)

Motivating example: Incidence of tumors in rodents Adapted from Gelman et al. (2014) Let s develop a Hierarchical model using the beta-binomial Bayesian approach seen so far Example Suppose we have the results of a clinical study of a drug in which rodents were exposed to either a dose of the drug or a control treatment (no dose) 4 out of 14 rodents in the control group developed tumors

Motivating example: Incidence of tumors in rodents Adapted from Gelman et al. (2014) Let s develop a Hierarchical model using the beta-binomial Bayesian approach seen so far Example Suppose we have the results of a clinical study of a drug in which rodents were exposed to either a dose of the drug or a control treatment (no dose) 4 out of 14 rodents in the control group developed tumors We want to estimate θ, the probability that the rodents in the control group developed a tumor given no dose of the drug

Motivating example: Incidence of tumors in rodents Data We also have the following data about the incidence of this kind of tumor in the control groups of other studies: Figure: Gelman et al. 2014 p.102

Motivating example: Incidence of tumors in rodents Bayesian analysis: setup Including the current experimental results, we have information on 71 random variables θ 1,..., θ 71

Motivating example: Incidence of tumors in rodents Bayesian analysis: setup Including the current experimental results, we have information on 71 random variables θ 1,..., θ 71 We can model the current and historical proportions as a random sample from some unknown population distribution: each y j is independent binomial data, given the sample sizes n j and experiment-specific θ j.

Motivating example: Incidence of tumors in rodents Bayesian analysis: setup Including the current experimental results, we have information on 71 random variables θ 1,..., θ 71 We can model the current and historical proportions as a random sample from some unknown population distribution: each y j is independent binomial data, given the sample sizes n j and experiment-specific θ j. Each θ j is in turn generated by a random process governed by a population distribution that depends on the parameters α and β

Motivating example: Incidence of tumors in rodents Bayesian analysis: model This relationship can be depicted as graphically as Figure: Hierarchical model (Gelman et al. 2014 p.103)

Motivating example: Incidence of tumors in rodents Bayesian analysis: probability model Formally, posterior distribution is now of the vector (θ, α, β). The joint prior distribution is and the joint posterior distribution is p(θ, α, β) = p(α, β)p(θ α, β) (7) p(θ, α, β y) p(θ, α, β)p(y θ, α, β) = p(α, β)p(θ α, β)p(y θ, α, β) = p(α, β)p(θ α, β)p(y θ) (8)

Motivating example: Incidence of tumors in rodents Bayesian analysis: joint posterior density Since the beta prior is conjugate, we can derive the joint posterior distribution analytically

Motivating example: Incidence of tumors in rodents Bayesian analysis: joint posterior density Since the beta prior is conjugate, we can derive the joint posterior distribution analytically Each y j is conditionally independent of the hyperparameters α, β given θ j. Hence, the likelihood function is still p(y θ, α, β) = p(y θ) = p(y 1, y 2,..., y J θ 1, θ 2,..., θ J ) J J ( ) nj = p(y j θ j ) = θ y j j (1 θ j ) n j y j (9) j=1 j=1 y j

Motivating example: Incidence of tumors in rodents Bayesian analysis: joint posterior density Since the beta prior is conjugate, we can derive the joint posterior distribution analytically Each y j is conditionally independent of the hyperparameters α, β given θ j. Hence, the likelihood function is still p(y θ, α, β) = p(y θ) = p(y 1, y 2,..., y J θ 1, θ 2,..., θ J ) J J ( ) nj = p(y j θ j ) = θ y j j (1 θ j ) n j y j (9) j=1 j=1 Now we also have a population distribution p(θ α, β): p(θ α, β) = p(θ 1, θ 2,..., θ J α, β) J Γ(α + β) = Γ(α)Γ(β) θα 1 j (1 θ j ) β 1 (10) j=1 y j

Motivating example: Incidence of tumors in rodents Bayesian analysis: joint posterior density Then, using equations (8) and (9), the unnormalized joint posterior distribution p(θ, α, β y) is p(α, β) J Γ(α + β) Γ(α)Γ(β) θα 1 j (1 θ j ) β 1 j=1 J j=1 θ y j j (1 θ j ) n j y j. (11)

Motivating example: Incidence of tumors in rodents Bayesian analysis: joint posterior density Then, using equations (8) and (9), the unnormalized joint posterior distribution p(θ, α, β y) is p(α, β) J Γ(α + β) Γ(α)Γ(β) θα 1 j (1 θ j ) β 1 j=1 J j=1 θ y j j (1 θ j ) n j y j. (11) We can also determine analytically the conditional posterior density of θ = (θ 1, θ 2,..., θ J ): p(θ α, β, y) = J j=1 Γ(α + β + n j ) Γ(α + y j )Γ(β + n j y j ) θα+y j 1 j (1 θ j ) β+n j y j 1. (12)

Motivating example: Incidence of tumors in rodents Bayesian analysis: joint posterior density Then, using equations (8) and (9), the unnormalized joint posterior distribution p(θ, α, β y) is p(α, β) J Γ(α + β) Γ(α)Γ(β) θα 1 j (1 θ j ) β 1 j=1 J j=1 θ y j j (1 θ j ) n j y j. (11) We can also determine analytically the conditional posterior density of θ = (θ 1, θ 2,..., θ J ): p(θ α, β, y) = J j=1 Γ(α + β + n j ) Γ(α + y j )Γ(β + n j y j ) θα+y j 1 j (1 θ j ) β+n j y j 1. (12) Note that equation (11), the conditional posterior, is now a function of (α, β). Each θ j depends on the hyperparameters of the hyperprior p(α, β).

Motivating example: Incidence of tumors in rodents Bayesian analysis: marginal posterior distribution of (α, β) To compute the marginal posterior density, observe that if we condition on y, equation (7) is equivalent to p(α, β y) = p(θ, α, β y) p(θ α, β, y) (13) which are equations (10) and (1) on the previous slide. Hence, p(α, β y) = p(α, β) = p(α, β) J Γ(α+β) j=1 Γ(α)Γ(β) θα 1 j (1 θ j ) β 1 J j=1 θy j J Γ(α+β+n j ) j=1 Γ(α+y j )Γ(β+n j y j ) θα+y j 1 j J j=1 Γ(α + β) Γ(α)Γ(β) Γ(α + y j )Γ(β + n j y j ), Γ(α + β + n j ) which is computationally tractable, given a prior for (α, β). j (1 θ j ) n j y j (1 θ j ) β+n j y j 1 (14)

Summary: beta-binomial hierarchical model We started by wanting to understand the true proportion of rodents in the control group of a clinical study that developed a tumor.

Summary: beta-binomial hierarchical model We started by wanting to understand the true proportion of rodents in the control group of a clinical study that developed a tumor. By modelling the relationship between different trials hierarchically, we were able to bring our uncertainty about the hyperparameters (α, β) into the model

Summary: beta-binomial hierarchical model We started by wanting to understand the true proportion of rodents in the control group of a clinical study that developed a tumor. By modelling the relationship between different trials hierarchically, we were able to bring our uncertainty about the hyperparameters (α, β) into the model Using analytical methods, we developed a model that, given a suitable population prior and the method of simulating draws from the distribution in order to estimate (α, β).

Bayesian Hierarchical Models Extension In general, if θ j is the population parameter for an observable x, and φ be a hyperprior distribution p(θ, φ x) = p(x θ, φ)p(θ, φ) p(x) = p(x θ)p(θ φ)p(φ) p(x) (15)

Bayesian Hierarchical Models Extension In general, if θ j is the population parameter for an observable x, and φ be a hyperprior distribution p(θ, φ x) = p(x θ, φ)p(θ, φ) p(x) = p(x θ)p(θ φ)p(φ) p(x) (15) The models can be extended with more levels by adding hyperpriors and hyperparameter vectors, leading to the factored form: p(θ, φ, ψ x) = p(x θ)p(θ φ)p(φ ψ)p(ψ) p(x) (16)

Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case

Bayesian Model Selection Problem definition The model selection problem: Given a set of models (i.e., families of parametric distributions) of different complexity, how should we choose the best one?

Bayesian Model Selection Bayesian solution (Adapted from Murphy 2012) Bayesian approach: compare the posterior over models H k H p(h k D) = p(d H k)p(h k ) H H p(h, D) (17)

Bayesian Model Selection Bayesian solution (Adapted from Murphy 2012) Bayesian approach: compare the posterior over models H k H p(h k D) = p(d H k)p(h k ) H H p(h, D) then, select MAP model as best Ĥ MAP = argmax H H (17) p(h D). (18)

Bayesian Model Selection Bayesian solution (Adapted from Murphy 2012) Bayesian approach: compare the posterior over models H k H p(h k D) = p(d H k)p(h k ) H H p(h, D) then, select MAP model as best (17) Ĥ MAP = argmax p(h D). (18) H H If we adopt a uniform prior to represent our uncertainty about the choice of models s.t. p(h k ) U(0, 1) p(h k ) 1, then Ĥ MAP = argmax H H p(h D) argmax H H p(d H ) (19)

Bayesian Model Selection Bayesian solution (Adapted from Murphy 2012) Bayesian approach: compare the posterior over models H k H p(h k D) = p(d H k)p(h k ) H H p(h, D) then, select MAP model as best (17) Ĥ MAP = argmax p(h D). (18) H H If we adopt a uniform prior to represent our uncertainty about the choice of models s.t. p(h k ) U(0, 1) p(h k ) 1, then Ĥ MAP = argmax H H p(h D) argmax H H p(d H ) (19) and so the problem reduces to choosing the model which maximizes the marginal likelihood (also called the evidence ): p(d H k ) = p(d θ k, H k )p(θ k H k )dθ k (20)

Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case

Bayes Factors Adapted from Kass and Raftery (1995) Bayes Factors are a natural way to compare models using marginal likelihoods

Bayes Factors Adapted from Kass and Raftery (1995) Bayes Factors are a natural way to compare models using marginal likelihoods In simplest case, we have two hypotheses H = {H 1, H 2 } about the random process which generated D according to distributions p(d H 1 ), p(d H 2 )

Bayes Factors Adapted from Kass and Raftery (1995) Bayes Factors are a natural way to compare models using marginal likelihoods In simplest case, we have two hypotheses H = {H 1, H 2 } about the random process which generated D according to distributions p(d H 1 ), p(d H 2 ) Recall the odds representation of probability: it gives a structure we can use in model selection odds = proportion of successes proportion of failures = probability 1 probability (21)

Bayes Factors Derivation Bayes theorem says p(h k D) = p(d H k )p(h k ) h H p(d) H h )p(h h ) (22)

Bayes Factors Derivation Bayes theorem says p(h k D) = p(d H k )p(h k ) h H p(d) H h )p(h h ) (22) Since p(h 1 D) = 1 p(h 2 D) (in the 2-hypothesis case), odds(h 1 D) = p(h 1 D) p(h 2 D) = p(d H 1) p(h 1 ) p(d H 2 ) p(h 2 ) posterior odds = Bayes factor x prior odds (23)

Bayes Factors Prior odds are transformed into the posterior odds by the ratio of marginal likelihoods. The Bayes factor for model H 1 against H 2 is B 12 = p(d H 1) p(d H 2 ) (24)

Bayes Factors Prior odds are transformed into the posterior odds by the ratio of marginal likelihoods. The Bayes factor for model H 1 against H 2 is B 12 = p(d H 1) p(d H 2 ) (24) Bayes factor is a summary of the evidence provided by the data in favour of one hypothesis over another

Bayes Factors Prior odds are transformed into the posterior odds by the ratio of marginal likelihoods. The Bayes factor for model H 1 against H 2 is B 12 = p(d H 1) p(d H 2 ) (24) Bayes factor is a summary of the evidence provided by the data in favour of one hypothesis over another Can interpret Bayes factors Jeffreys scale of evidence: B jk : Evidence against H k : 1 to 3.2 Not worth more than a bare mention 3.2 to 10 Substantial 10 to 100 Strong 100 or above Decisive

Bayes Factors Coin toss example (Adapted from Arnaud) Suppose you toss a coin 6 times and observe 6 heads.

Bayes Factors Coin toss example (Adapted from Arnaud) Suppose you toss a coin 6 times and observe 6 heads. If θ is the probability of getting heads, can test H 1 : θ = 1 2 against H 2 : θ Unif ( 1 2, 1]

Bayes Factors Coin toss example (Adapted from Arnaud) Suppose you toss a coin 6 times and observe 6 heads. If θ is the probability of getting heads, can test H 1 : θ = 1 2 against H 2 : θ Unif ( 1 2, 1] Then, the Bayes factor for fair against biased is B 12 = p(d H 1) p(d H 2 ) = = = p(d θ1, H 1 )p(θ 1 H 1 )dθ 1 p(d θ2, H 2 )p(θ 2 H 2 )dθ 2 1 1 1 2 θ x (1 θ) 6 x dθ 2 ( 1 2 )x (1 1 2 )6 x 1 2 1 1 θ 6 dθ 2 ( 1 4.535. 2 )6

Bayes Factors Gaussian mean example (Adapted from Arnaud) Suppose we have a random variable X µ, σ 2 N (µ, σ 2 ) where σ 2 is known but µ is unknown.

Bayes Factors Gaussian mean example (Adapted from Arnaud) Suppose we have a random variable X µ, σ 2 N (µ, σ 2 ) where σ 2 is known but µ is unknown. Our two hypotheses are H 1 : µ = 0 vs H 2 : µ N (ξ, τ 2 )

Bayes Factors Gaussian mean example (Adapted from Arnaud) Suppose we have a random variable X µ, σ 2 N (µ, σ 2 ) where σ 2 is known but µ is unknown. Our two hypotheses are H 1 : µ = 0 vs H 2 : µ N (ξ, τ 2 ) Then, the Bayes factor for H 1 against H 2 is B 12 = p(d H 1) p(d H 2 ) = = = N (x µ, σ 2 )N (µ ξ, τ 2 )dµ N (x µ, σ 2 )δ 0 (µ)dµ { } { 1 exp (x µ) 2 1 2πσ 2 2σ 2 exp (µ ξ) 2 }dµ 2πτ 2 2τ 2 σ 2 { σ 2 + τ exp 2 { } 1 exp x 2 2πσ 2 2σ 2 τ 2 x 2 2σ 2 (σ 2 + τ 2 ) }. (25)

Bayes Factors Key points Bayes factors allow you to compare models with different parameter spaces: the parameters are marginalized out in the integral

Bayes Factors Key points Bayes factors allow you to compare models with different parameter spaces: the parameters are marginalized out in the integral Thus unlike MLE model comparison methods, Bayes factors do not favour more complex models. Built-in protection against overfitting

Bayes Factors Key points Bayes factors allow you to compare models with different parameter spaces: the parameters are marginalized out in the integral Thus unlike MLE model comparison methods, Bayes factors do not favour more complex models. Built-in protection against overfitting Recall AIC is -2 ( log ( likelihood )) + 2 K, where K is number of parameters in model

Bayes Factors Key points Bayes factors allow you to compare models with different parameter spaces: the parameters are marginalized out in the integral Thus unlike MLE model comparison methods, Bayes factors do not favour more complex models. Built-in protection against overfitting Recall AIC is -2 ( log ( likelihood )) + 2 K, where K is number of parameters in model Since based on ML estimate of parameters, which are prone to overfit, AIC is biased towards more complex models and must be adjusted by the parameter K

Bayes Factors Key points Bayes factors allow you to compare models with different parameter spaces: the parameters are marginalized out in the integral Thus unlike MLE model comparison methods, Bayes factors do not favour more complex models. Built-in protection against overfitting Recall AIC is -2 ( log ( likelihood )) + 2 K, where K is number of parameters in model Since based on ML estimate of parameters, which are prone to overfit, AIC is biased towards more complex models and must be adjusted by the parameter K Bayes factors are sensitive to the prior. In Gaussian examples, as τ, B 12 0 regardless of the data x. If prior is vague on a hypothesis, Bayes factor selection will not favour that hypothesis.

Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case

Computing Marginal Likelihood (Adapted from Murphy 2013) Suppose we write the prior as p(θ) = q(θ) ( = θα 1 (1 θ) β 1 Z 0 B(α, β) ), (26) the likelihood as p(d θ) = q(d θ) Z l ( = θy (1 θ) n y ( n ) 1 ), (27) y and the posterior as p(θ D) = q(θ D) ( = θα+y 1 (1 θ) β+n y 1 Z N B(α + y, β + n y) ). (28)

Computing Marginal Likelihood (Adapted from Murphy 2013) Then: p(θ D) = p(d θ)p(θ) p(d) = q(d θ)q(θ) q(θ D) Z N p(d) = Z l Z 0 p(d) Z ( N = Z 0 Z l ( n y ) B(α + y, β + n y) B(α, β) ) (29) The computation reduces to a ratio of normalizing constants in this special case.

References I Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin Bayesian Data Analysis Chapman Hall/CRC, 2014. Kevin Murphy Machine Learning: A Probabilistic Perspective MIT Press, 2013. Arnaud Doucet STAT 535C: Statistical Computing Course lecture slides (2009) Accessed 14 January 2016 from http://www.cs.ubc.ca/ arnaud/stat535.html

References II Robert E. Kass; Adrian E. Raftery Bayes Factors Journal of the American Statistical Association, Vol. 90, No. 430 773-795, 1995.