Lecture 2: Priors and Conjugacy

Similar documents
Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 2: Conjugate priors

CS540 Machine learning L9 Bayesian statistics

Introduction to Machine Learning

Bayesian Models in Machine Learning

PMR Learning as Inference

CPSC 540: Machine Learning

Hierarchical Models & Bayesian Model Selection

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Computational Cognitive Science

Linear Models A linear model is defined by the expression

Introduction to Probabilistic Machine Learning

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

COS513 LECTURE 8 STATISTICAL CONCEPTS

Probability and Estimation. Alan Moses

Discrete Binary Distributions

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Data Analysis and Uncertainty Part 2: Estimation

10. Exchangeability and hierarchical models Objective. Recommended reading

Bayesian Methods for Machine Learning

Learning Bayesian network : Given structure and completely observed data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian RL Seminar. Chris Mansley September 9, 2008

Computational Cognitive Science

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

David Giles Bayesian Econometrics

Conjugate Priors, Uninformative Priors

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

More Spectral Clustering and an Introduction to Conjugacy

CS540 Machine learning L8

Lecture 1: Bayesian Framework Basics

Lecture 6: Model Checking and Selection

Bayesian Regression Linear and Logistic Regression

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Analysis for Natural Language Processing Lecture 2

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

STAT J535: Chapter 5: Classes of Bayesian Priors

Introduction to Bayesian inference

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Computational Perception. Bayesian Inference

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Time Series and Dynamic Models

Probability Distributions

Lecture 11: Probability Distributions and Parameter Estimation

Density Estimation. Seungjin Choi

g-priors for Linear Regression

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Exponential Families

Introduction into Bayesian statistics

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Principles of Bayesian Inference

Machine Learning using Bayesian Approaches

Introduc)on to Bayesian Methods

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Foundations of Statistical Inference

Principles of Bayesian Inference

A Very Brief Summary of Bayesian Inference, and Examples

Lecture 4: Probabilistic Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

an introduction to bayesian inference

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Lecture : Probabilistic Machine Learning

Bayesian Inference. Chapter 9. Linear models and regression

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Introduction to Machine Learning

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

4: Parameter Estimation in Fully Observed BNs

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Language as a Stochastic Process

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Artificial Intelligence

Principles of Bayesian Inference

Probabilistic Graphical Models

Introduction to Machine Learning

Lecture 3. Univariate Bayesian inference: conjugate analysis

COMP90051 Statistical Machine Learning

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

How can a Machine Learn: Passive, Active, and Teaching

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Introduction to Machine Learning CMU-10701

Probability Theory for Machine Learning. Chris Cremer September 2015

Classical and Bayesian inference

Bayesian Inference and MCMC

Graphical Models for Collaborative Filtering

Part III. A Decision-Theoretic Approach and Bayesian testing

Review of Probabilities and Basic Statistics

Probabilistic and Bayesian Machine Learning

Transcription:

Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014

Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I. Jordan (U. Berkeley) http://www.cs.berkeley.edu/~jordan/courses/ 260-spring10/ David Blei (U. Princeton) http://www.cs.princeton.edu/courses/archive/ fall07/cos597c/syllabus.html Andrew Ng (Stanford U.) https://www.youtube.com/watch?v=uzxylbk2c7e Taylan Cemgil (Bogazici U.) http://dl.dropboxusercontent.com/u/9787379/ cmpe58k/index.html

Maximum likelihood estimation Given observed data X and the assumption that X p(x θ), the maximum likelihood estimate (MLE) is ˆθ = argmax θ p(x θ)

What are priors for? To induce prior beliefs To avoid overfitting Controlling model complexity: i) inducing sparsity=regularization, ii) marginal likelihood. Marginalization of model parameters (represented as a distribution, not a point estimate.

Overfitting 1 1 Bishop, Pattern Recognition and Machine Learning

Types of priors 2 Objective priors: noninformative priors that attempt to capture ignorance and have good frequentist properties. Subjective priors: priors should capture our beliefs as well as possible. They are subjective but not arbitrary. The key ingredient of Bayesian methods is not the prior, it is the idea of averaging over different possibilities. Hierarchical priors: multiple levels of priors p(θ) = p(θ α)p(α)d α where p(α) is called a hyperprior. Empirical priors: learn some of the parameters of the prior from the data (i.e. Empirical Bayes!) 2 Z. Ghahramani s lecture

Empirical priors 3 Given: p(d α) = p(d θ)p(θ α)d θ where α is the vector of hyperparameters. Estimation: ˆα = argmaxp(d α) α This method is called Type II Maximum Likelihood. Prediction: Plus: Minus: p(x D, ˆα) = Tuning the prior belief to data. 3 Z. Ghahramani s lecture p(x θ)p(θ D, ˆα)d θ Double counting of data: Overfitting.

Bayesian Inference Posterior: p(θ D) = p(d θ)p(θ) p(d θ)p(θ)dθ Prediction: p(x D) = p(x θ)p(θ D)dθ How can we calculate the integrals above? Approximate: Variational Bayes, MCMC, Laplace. Closed-form solution: Conjugate priors

Conjugate Prior 4 If p(θ D) is in the same family as p(θ), then p(θ) is called a conjugate prior for p(d θ). 4 H. Raiffa and R. Schlaifer, Applied Statistical Decision Theory,1961.

Beta distribution PDF: Beta(x α, β) = x α 1 (1 x) β 1 1 0 tα 1 (1 t) β 1 Mean: E[x] = α α + β E[ln x] = ψ(α) ψ(α + β) where ψ( ) is the digamma function. Variance: var[x] = αβ (α + β) 2 (α + β + 1)

PDF of the Beta distribution 5 5 http://www.ntrand.com/beta-distribution/

Binomial distribution For x {0,, n}, Bin(x n, p) = ( ) n p x (1 p) (n x) x n: number of repetitions of a binary experiment. p: success probability. x: number of successes in n trials. Bernoulli distribution: Beta(x 1, p).

Beta-Binomial conjugacy Data: N binary observations: X = [x 1, x 2,, x N ] (e.g. N i.i.d. coin tosses with K heads). Problem: Analyze how fair the coin is (i.e. What is the true heads probability?). Model: Likelihood: p(d θ) = Bin(x = K N, θ). Prior: p(θ) = Beta(θ α, β). Posterior: p(θ D) = Beta(θ α + K, β + N K). Posterior mean: E(θ x) = α + x α + β + N

Dirichlet distribution PDF: Dir(x 1,, x K α) = K i=1 x α i 1 i B(α) where α = [α 1,, α K ] and B(α) = πk i=1 Γ(α i) Γ( K i=1 α i) Mean: α i E[x i ] = K i=1 α k ( K ) E[ln x i ] = ψ(α i ) ψ α i i=1

Dirichlet distribution (2) 6 6 http://projects.csail.mit.edu/church/wiki/models_with_ Unbounded_Complexity

Multinomial distribution For x i [0,, N], K i=1 x i = N, and K i=1 p i = 1, Mult(x 1,, x K N, p 1,, p K ) = N! x 1! x K! px 1 1 px 2 2 px K K

Dirichlet-multinomial conjugacy Let X = [x 1,, x N ] be N i.i.d. draws from a multinomial distribution. The likelihood is N i=1 P(X θ) = θ I(x j =θ 1 ) 1 θ When the Dirichlet prior N i=1 I(x j =θ K ) K P(θ α) θ α 1 1 θα K K is applied, then the posterior becomes N i=1 P(θ X, α) θ I(x j =θ 1 )+α 1 1 1 θ N i=1 I(x j =θ K )+α K 1 K

Dirichlet-multinomial conjugacy (2) 7 The posterior mean is E[θ i x, α] = α i + N j=1 I(x i) N + K l=1 α l α i = κ k l=1 α l + (1 κ) x i where x i = 1 N j=1 N I(x j = i) is the maximum likelihood estimator l and κ = α l N + l α (0, 1) is the prior mean. l Implications: Convex combination of the MLE and the prior mean κ 0 as N E[θ i x, α] is a shrinkage estimator! 7 M.I.Jordan s lecture 4

Logarithm of the normal distribution N (x µ, Σ) = (2π) D 2 Σ 1 2 e 1 2 (x µ)t Σ 1 (x µ) log N (x µ, Σ) = D 2 log 2π 1 2 log Σ 1 2 (x µ)t Σ 1 (x µ) = 1 log Σ 1 } 2 {{} 2 xt Σ 1 x 1 2 µt Σ 1 µ + x T Σ 1 µ + const 1 + 2 log Σ 1 = 1 2 log Σ 1 2 tr(σ 1 xx T ) 1 2 tr(σ 1 µµ T ) + x T Σ 1 µ + const

Normal distribution with known variance (one sample) The model: x µ, σ 2 N (x µ, σ 2 ) µ N (µ µ 0, σ 2 0) σ 2 known The posterior is then x p(µ x, σ 2 ) = N µ σ 2 + µ 0 σ0 2 1 σ 2 + 1, 1 σ 2 + 1 σ 2 σ0 2 0 Derivation of this result is to be done on the whiteboard.

Normal distribution with known variance (multiple samples) The model: The posterior is then p(µ x, σ 2 ) = N µ x 1,, x N µ, σ 2 N (x µ, σ 2 ) µ N (µ µ 0, σ 2 0) σ 2 known N i=1 x i σ 2 + µ 0 σ 2 0 N σ 2 + 1 σ 2 0, N σ 2 + 1 σ0 2 Derivation of this result is to be done on the whiteboard.

Gamma distribution PDF: G(x a, b) = x a 1 e bx Z(a, b) Mean: E[x] = a b

Gamma distribution (2)

Normal distribution with known mean unknown variance The model: x 1,, x N µ, σ 2 N (x µ, σ 2 ) = N (x µ, (τ 2 ) 1 ) µ known σ 2 G(τ 2 a, b) The posterior is then ( p(τ 2 x, µ) = G τ 2 a + N 2, b + 1 2 ) N (x i µ) 2. i=1 Derivation of this result is to be done on the whiteboard.

Wishart distribution PDF : W(X V, ν) = Mean : νv Relation to normal distribution : ν D 1 1 Z(V, ν) X 2 e 1 2 tr(v 1 X) s 1,, s ν N (s 0, V) S = [s 1 ; s 2 ; ; s ν ]; X = S T S

Multivariate normal distribution with unknown covariance The model: x 1,, x N µ, Σ N (x µ, Σ) = N (x µ, (Λ) 1 ) The posterior is then µ known Λ W(Λ W 0, ν) p(λ x 1,, x N, µ) = ( ) 1 N W Λ W 1 0 + (x i µ)(x i µ) T, ν + N. i=1 Derivation of this result is to be done on the whiteboard.

Recap Beta Binomial = Beta Dirichlet Multinomial = Dirichlet Normal Normal = Normal Gamma Normal = Gamma Wishart Normal = Wishart