an introduction to bayesian inference

Similar documents
Bayesian Inference and MCMC

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probabilistic Machine Learning

Part 1: Expectation Propagation

Density Estimation. Seungjin Choi

Lecture 7 and 8: Markov Chain Monte Carlo

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Unsupervised Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Variational Scoring of Graphical Model Structures

Monte Carlo in Bayesian Statistics

Bayesian Methods for Machine Learning

Bayesian Machine Learning

13: Variational inference II

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probabilistic Graphical Networks: Definitions and Basic Results

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Machine Learning using Bayesian Approaches

COMP90051 Statistical Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Expectation Propagation for Approximate Bayesian Inference

Hierarchical Models & Bayesian Model Selection

Lecture : Probabilistic Machine Learning

Lecture 8: Bayesian Estimation of Parameters in State Space Models

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Computational Cognitive Science

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Learning with Probabilities

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Graphical Models for Collaborative Filtering

Bayesian Methods: Naïve Bayes

The connection of dropout and Bayesian statistics

Integrated Non-Factorized Variational Inference

Recent Advances in Bayesian Inference Techniques

Lecture 6: Graphical Models: Learning

CPSC 540: Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Markov Chain Monte Carlo methods

Introduction to Bayesian inference

Notes on pseudo-marginal methods, variational Bayes and ABC

Introduction to Probabilistic Machine Learning

CPSC 540: Machine Learning

Part IV: Monte Carlo and nonparametric Bayes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Machine Learning for Data Science (CS4786) Lecture 24

Naïve Bayes classification

Machine Learning Summer School

Sampling Methods (11/30/04)

GWAS V: Gaussian processes

Quantitative Biology II Lecture 4: Variational Methods

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Principles of Bayesian Inference

CSC 2541: Bayesian Methods for Machine Learning

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction into Bayesian statistics

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Probability and Estimation. Alan Moses

Frequentist Statistics and Hypothesis Testing Spring

Learning the hyper-parameters. Luca Martino

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Variational Methods in Bayesian Deconvolution

Answers and expectations

Markov Networks.

p L yi z n m x N n xi

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Learning Bayesian network : Given structure and completely observed data

LECTURE 15 Markov chain Monte Carlo

Probabilistic and Bayesian Machine Learning

Probabilistic Graphical Models

Introduction to Machine Learning

Notes on Machine Learning for and

Artificial Intelligence

Lecture 2: Priors and Conjugacy

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

COS513 LECTURE 8 STATISTICAL CONCEPTS

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Approximate Inference Part 1 of 2

Computational Perception. Bayesian Inference

Bayesian Networks. Motivation

Approximate Inference Part 1 of 2

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

16 : Approximate Inference: Markov Chain Monte Carlo

Bayesian Models in Machine Learning

Bayesian Learning (II)

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Pattern Recognition and Machine Learning

Lecture 3: Machine learning, classification, and generative models

MODULE -4 BAYEIAN LEARNING

Transcription:

with an application to network analysis http://jakehofman.com january 13, 2010

motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena are simple enough to generalize to future observations

motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena are simple enough to generalize to future observations claim: provides a systematic framework to infer such models from observed data

motivation principles behind bayesian interpretation of probability and are well established (bayes, laplace, etc., 18th century) + recent advances in mathematical techniques and computational resources have enabled successful applications of these principles to real-world problems

motivation: a bayesian approach to network modularity

outline principles 1 principles (what we d like to do) : joint, marginal, and conditional probabilities : inverting conditional probabilities : unknowns as random variables : + 2 (what we re able to do) monte carlo methods: representative samples variational methods: bound optimization references 3

joint, marginal, and conditional probabilities joint distribution p XY (X = x, Y = y): probability X = x and Y = y conditional distribution p X Y (X = x Y = y): probability X = x given Y = y marginal distribution p X (X ): probability X = x (regardless of Y )

sum and product rules sum rule sum out settings of irrelevant variables: p (x) = y Ω Y p (x, y) (1) product rule the joint as the product of the conditional and marginal: p (x, y) = p (x y) p (y) (2) = p (y x) p (x) (3)

outline principles 1 principles (what we d like to do) : joint, marginal, and conditional probabilities : inverting conditional probabilities : unknowns as random variables : + 2 (what we re able to do) monte carlo methods: representative samples variational methods: bound optimization references 3

inverting conditional probabilities equate far right- and left-hand sides of product rule p (y x) p (x) = p (x, y) = p (x y) p (y) (4) and divide: (bayes and price 1763) the probability of Y given X from the probability of X given Y : p (y x) = p (x y) p (y) p (x) (5) where p (x) = y Ω Y p (x y) p (y) is the normalization constant

example: diagnoses a la bayes population 10,000 1% has (rare) disease 1 test is 99% (relatively) effective, i.e. given a patient is sick, 99% test positive given a patient is healthy, 99% test negative 1 subtlety: assuming this fraction is known

example: diagnoses a la bayes population 10,000 1% has (rare) disease test is 99% (relatively) effective, i.e. given a patient is sick, 99% test positive given a patient is healthy, 99% test negative given positive test, what is probability the patient is sick? 1 1 follows wiggins (2006)

example: diagnoses a la bayes sick population (100 ppl) healthy population (9900 ppl) 1% (99 ppl test +) 1% (1 ppl test -) 99% (99 ppl test +) 99% (9801 ppl test -) 99 sick patients test positive, 99 healthy patients test positive

example: diagnoses a la bayes sick population (100 ppl) healthy population (9900 ppl) 1% (99 ppl test +) 1% (1 ppl test -) 99% (99 ppl test +) 99% (9801 ppl test -) 99 sick patients test positive, 99 healthy patients test positive given positive test, 50% probability that patient is sick

example: diagnoses a la bayes know probability of testing positive/negative given sick/healthy use to invert to probability of sick/healthy given positive/negative test 99/100 1/100 {}}{{}}{ p (test + sick) p (sick) p (sick test +) = p (test +) }{{} 99/100 2 +99/100 2 =198/100 2 = 99 198 = 1 2 (6)

example: diagnoses a la bayes know probability of testing positive/negative given sick/healthy use to invert to probability of sick/healthy given positive/negative test 99/100 1/100 {}}{{}}{ p (test + sick) p (sick) p (sick test +) = p (test +) }{{} 99/100 2 +99/100 2 =198/100 2 = 99 198 = 1 2 (6) most work in calculating denominator (normalization)

outline principles 1 principles (what we d like to do) : joint, marginal, and conditional probabilities : inverting conditional probabilities : unknowns as random variables : + 2 (what we re able to do) monte carlo methods: representative samples variational methods: bound optimization references 3

interpretations of probabilities (just enough philosophy) frequentists: limit of relative frequency of events for large number of trials bayesians: measure of a state of knowledge, quantifying degrees of belief (jaynes 2003)

interpretations of probabilities (just enough philosophy) frequentists: limit of relative frequency of events for large number of trials bayesians: measure of a state of knowledge, quantifying degrees of belief (jaynes 2003) key difference: bayesians permit assignment of probabilities to unknown/unobservable hypotheses (frequentists do not)

interpretations of probabilities (just enough philosophy) e.g., inferring model parameters Θ from observed data D: frequentist approach: calculate parameter setting that maximizes likelihood of observed data (point estimate), Θ = argmax p(d Θ) (7) Θ bayesian approach: calculate distribution over parameter settings given data, p(θ D) =? (8)

interpretations of probabilities (just enough philosophy) using bayes rule being bayesian

interpretations of probabilities (just enough philosophy) using bayes rule being bayesian s/bayesian/subjective probabilist/g

outline principles 1 principles (what we d like to do) : joint, marginal, and conditional probabilities : inverting conditional probabilities : unknowns as random variables : + 2 (what we re able to do) monte carlo methods: representative samples variational methods: bound optimization references 3

+ : treat unknown quantities as random variables use to systematically update prior knowledge in the presence of observed data posterior {}}{ p(θ D) = likelihood prior {}}{{}}{ p(d Θ) p(θ) p(d) }{{} evidence (9)

example: coin flipping observe independent coin flips (bernoulli trials) infer distribution over coin bias

example: coin flipping prior p(θ) over coin bias before observing flips

example: coin flipping observe flips: HTHHHTTHHHH

example: coin flipping update posterior p(θ D) using

example: coin flipping observe flips: HHHHHHHHHHHHTHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHH HHHHHHTHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHH HHHHHHHHT

example: coin flipping update posterior p(θ D) using

naive bayes for document classifiction model presence/absence of each word as independent coin flip p (word class) = Bernoulli(θ wc ) (10) p (words class) = p (word 1 class) p (word 2 class)...(11)

naive bayes for document classifiction model presence/absence of each word as independent coin flip p (word class) = Bernoulli(θ wc ) (10) p (words class) = p (word 1 class) p (word 2 class)...(11) maximum likelihood estimates of probabilities from word and class counts ˆθ wc = N wc N c (12)

naive bayes for document classifiction model presence/absence of each word as independent coin flip p (word class) = Bernoulli(θ wc ) (10) p (words class) = p (word 1 class) p (word 2 class)...(11) maximum likelihood estimates of probabilities from word and class counts ˆθ wc = N wc N c (12) use bayes rule to calculate distribution over classes given words p (class words, Θ) = p (words class, Θ) p (class, Θ) p (words, Θ) (13)

naive bayes for document classification example: spam filtering for enron email using one word 2 2 code: http://github.com/jhofman/ddm/blob/master/2009/ lecture 03/enron naive bayes.sh

naive bayes for document classification example: spam filtering for enron email using one word 2 2 code: http://github.com/jhofman/ddm/blob/master/2009/ lecture 03/enron naive bayes.sh

naive bayes for document classification example: spam filtering for enron email using one word 2 2 code: http://github.com/jhofman/ddm/blob/master/2009/ lecture 03/enron naive bayes.sh

naive bayes for document classification example: spam filtering for enron email using one word 2 2 code: http://github.com/jhofman/ddm/blob/master/2009/ lecture 03/enron naive bayes.sh

naive bayes for document classification example: spam filtering for enron email using one word 2 guard against overfitting by smoothing counts, equivalent to maximum a posteriori (map) inference ˆθ wc = N wc + α N c + α + β (14) 2 code: http://github.com/jhofman/ddm/blob/master/2009/ lecture 03/enron naive bayes.sh

naive bayes for document classification naive bayes for document classification is neither naive nor bayesian! not so naive: works well in, not in theory not bayesian: point estimates ˆθ wc for parameters rather than distributions over parameters

quantities of interest maintains full posterior distributions over unknowns many quantities of interest require expectations under these posteriors, e.g. posterior mean and predictive distribution: Θ = E p(θ D) [Θ] = dθ Θ p(θ D) (15) p (x D) = E p(θ D) [p (x Θ, D)] = dθ p (x Θ, D) p(θ D) (16)

quantities of interest maintains full posterior distributions over unknowns many quantities of interest require expectations under these posteriors, e.g. posterior mean and predictive distribution: Θ = E p(θ D) [Θ] = dθ Θ p(θ D) (15) p (x D) = E p(θ D) [p (x Θ, D)] = dθ p (x Θ, D) p(θ D) often can t compute posterior (normalization), let alone expectations with respect to it approximation methods (16)

outline principles sampling methods variational methods references 1 principles (what we d like to do) : joint, marginal, and conditional probabilities : inverting conditional probabilities : unknowns as random variables : + 2 (what we re able to do) monte carlo methods: representative samples variational methods: bound optimization references 3

representative samples sampling methods variational methods references general approach: approximate intractable expectations via sum over representative samples 3 Φ = E p(x) [φ(x)] = dx φ(x) }{{} arbitrary function p(x) }{{} target density (17) 3 follows mackay (2003)

representative samples sampling methods variational methods references general approach: approximate intractable expectations via sum over representative samples 3 Φ = E p(x) [φ(x)] = dx φ(x) }{{} arbitrary function p(x) }{{} target density (17) Φ = 1 R R φ(x (r) ) (18) r=1 3 follows mackay (2003)

representative samples sampling methods variational methods references general approach: approximate intractable expectations via sum over representative samples 3 Φ = E p(x) [φ(x)] = dx φ(x) }{{} arbitrary function p(x) }{{} target density (17) Φ = 1 R R φ(x (r) ) (18) r=1 shifts problem to finding good samples 3 follows mackay (2003)

representative samples sampling methods variational methods references further complication: in general we can only evaluate the target density to within a multiplicative (normalization) constant, i.e. p(x) = p (x) Z and p (x (r) ) can be evaluated with Z unknown (19)

sampling methods sampling methods variational methods references monte carlo methods uniform sampling importance sampling rejection sampling... markov chain monte carlo (mcmc) methods metropolis-hastings gibbs sampling...

outline principles sampling methods variational methods references 1 principles (what we d like to do) : joint, marginal, and conditional probabilities : inverting conditional probabilities : unknowns as random variables : + 2 (what we re able to do) monte carlo methods: representative samples variational methods: bound optimization references 3

bound optimization sampling methods variational methods references general approach: replace integration with optimization construct auxiliary function upper-bounded by log-evidence, maximize auxiliary 9.4. The EM function Algorithm in General 453 he EM algorithm involves alterately computing a lower bound n the log likelihood for the curent parameter values and then aximizing this bound to obtain he new parameter values. See he text for a full discussion. ln p(x θ) L (q,θ) θ old θ new 4 omplete data) 4 image log likelihood from bishop function (2006) whose value we wish to maximize. We start ith some initial parameter value θ old, and injake the hofman first E step we an introduction evaluate thetoposte-

variational bayes sampling methods variational methods references bound log of expected value by expected value of log using jensen s inequality 5 : ln p(d) = ln dθ p(d Θ)p(Θ) = ln dθ p(d Θ)p(Θ) q(θ) q(θ) [ ] p(d Θ)p(Θ) dθ ln q(θ) q(θ) for sufficiently simple (i.e. factorized) approximating distribution q(θ), right-hand side can be easily evaluated and optimized 5 image from feynman (1972)

variational bayes sampling methods variational methods references iterative coordinate ascent algorithm provides controlled analytic approxmations to posterior and evidence approximate posterior q(θ) minimizes kullback-leibler distance to true posterior resulting deterministic algorithm is often fast and scalable

variational bayes sampling methods variational methods references iterative coordinate ascent algorithm provides controlled analytic approxmations to posterior and evidence approximate posterior q(θ) minimizes kullback-leibler distance to true posterior resulting deterministic algorithm is often fast and scalable complexity of approximation often limited (to, e.g., mean-field theory, assuming weak interaction between unknowns) iterative algorithm requires restarts, no guarantees on quality of approximation

outline principles sampling methods variational methods references 1 principles (what we d like to do) : joint, marginal, and conditional probabilities : inverting conditional probabilities : unknowns as random variables : + 2 (what we re able to do) monte carlo methods: representative samples variational methods: bound optimization references 3

sampling methods variational methods references information theory, inference, and learning algorithms, mackay (2003) pattern recognition and machine learning, bishop (2006) bayesian data analysis, gelman, et. al. (2003) probabilistic inference using markov chain monte carlo methods, neal (1993) graphical models, exponential families, and variational inference, wainwright & jordan (2006) probability theory: the logic of science, jaynes (2003) what is..., wiggins (2006) view on cran variational-bayes.org variational for network modularity

example: a bayesian approach to network modularity

example: a bayesian approach to network modularity

example: a bayesian approach to network modularity inferred topological communities correspond to sub-disciplines

Thanks. Questions? 6 6 jake@jakehofman.com, @jakehofman