Probabilistic and Bayesian Machine Learning

Similar documents
Unsupervised Learning

Discrete Binary Distributions

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

CSC321 Lecture 18: Learning Probabilistic Models

Lecture : Probabilistic Machine Learning

Lecture 10 and 11: Text and Discrete Distributions

Unsupervised Learning

COMP90051 Statistical Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Classification & Information Theory Lecture #8

Machine Learning using Bayesian Approaches

Naïve Bayes classification

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Bayesian RL Seminar. Chris Mansley September 9, 2008

Introduction to Bayesian inference

Probabilistic and Bayesian Machine Learning

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Probability Theory Review

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Some Concepts of Probability (Review) Volker Tresp Summer 2018

PATTERN RECOGNITION AND MACHINE LEARNING

Introduction to Bayesian Learning. Machine Learning Fall 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Statistical Machine Learning Lectures 4: Variational Bayes

an introduction to bayesian inference

Artificial Intelligence

Expectation Propagation Algorithm

Computational Cognitive Science

Bayesian Models in Machine Learning

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Computational Cognitive Science

CS 361: Probability & Statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Expectation Maximization

CS 361: Probability & Statistics

Lecture 1a: Basic Concepts and Recaps

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Probability Theory for Machine Learning. Chris Cremer September 2015

Hierarchical Models & Bayesian Model Selection

Bayesian Learning (II)

Quantitative Biology II Lecture 4: Variational Methods

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Lecture 2: Priors and Conjugacy

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Probabilistic & Unsupervised Learning

Language as a Stochastic Process

Lecture 1: Probability Fundamentals

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

13: Variational inference II

PMR Learning as Inference

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Generative Models for Discrete Data

CS 188: Artificial Intelligence. Our Status in CS188

Introduction to Machine Learning

G8325: Variational Bayes

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduction to Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Bayesian Methods: Naïve Bayes

Lecture 11: Probability Distributions and Parameter Estimation

{ p if x = 1 1 p if x = 0

Cheng Soon Ong & Christian Walder. Canberra February June 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Probabilistic Modelling and Bayesian Inference

CPSC 340: Machine Learning and Data Mining

Introduction to Bayesian Statistics

Probabilistic Modelling and Bayesian Inference

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Bayesian Methods for Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Variational Principal Components

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Grundlagen der Künstlichen Intelligenz

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Week 3: The EM algorithm

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS 630 Basic Probability and Information Theory. Tim Campbell

Lecture 2: Conjugate priors

Classical and Bayesian inference

1 Introduction. P (n = 1 red ball drawn) =

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Lecture 6: Graphical Models: Learning

Learning Bayesian network : Given structure and completely observed data

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Notes on Machine Learning for and

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Foundations of Statistical Inference

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Bayesian Inference Course, WTCN, UCL, March 2013

Transcription:

Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London

Why a probabilistic approach? Many machine learning problems can be expressed as latent variable problems. Given some data, solution can be obtained by inferring the values of unobserved, latent variables. There is much uncertainty in the world: Noise in observations. Intrinsic stochasticity. Effects that are complex, unknown, and/or not understood. Our own state of belief being not certain. Probability theory gives coherent and simple way to reason about uncertainty. Probabilistic modelling gives powerful language to express our knowledge about the world. powerful computational framework for inference and learning about the world.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probabilities are non-negative P (x) 0 x. x X Basic Rules of Probability Probabilities normalise: P (x) = 1 for distributions if x is a discrete variable and p(x)dx = 1 for probability densities over continuous variables + The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y), assuming y is discrete. The conditional probability of x given y is: P (x y) = P (x, y)/p (y). Independent random variables: X Y means P (x, y) = P (x)p (y). Conditional independence: X Y Z (X conditionally independent of Y given Z) means P (x, y z) = P (x z)p (y z) and P (x y, z) = P (x z). Bayes Rule: P (x, y) = P (x)p (y x) = P (y)p (x y) P (y x) = P (x y)p (y) P (x) Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context.

Probability, Information and Entropy Information is the reduction of uncertainty. How do we measure uncertainty? Some axioms (informal): if something is certain its uncertainty = 0 uncertainty should be maximum if all choices are equally probable uncertainty (information) should add for independent sources This leads to a discrete random variable X having uncertainty equal to the entropy function: H(X) = P (X = x) log P (X = x) X=x measured in bits (binary digits) if the base 2 logarithm is used or nats (natural digits) if the natural (base e) logarithm is used.

Probability, Information and Entropy Information is the reduction of uncertainty. How do we measure uncertainty? Some axioms (informal): if something is certain its uncertainty = 0 uncertainty should be maximum if all choices are equally probable uncertainty (information) should add for independent sources This leads to a discrete random variable X having uncertainty equal to the entropy function: H(X) = P (X = x) log P (X = x) X=x measured in bits (binary digits) if the base 2 logarithm is used or nats (natural digits) if the natural (base e) logarithm is used.

Probability, Information and Entropy Information is the reduction of uncertainty. How do we measure uncertainty? Some axioms (informal): if something is certain its uncertainty = 0 uncertainty should be maximum if all choices are equally probable uncertainty (information) should add for independent sources This leads to a discrete random variable X having uncertainty equal to the entropy function: H(X) = P (X = x) log P (X = x) X=x measured in bits (binary digits) if the base 2 logarithm is used or nats (natural digits) if the natural (base e) logarithm is used.

Probability, Information and Entropy Surprise (for event X = x): log P (X = x) Entropy = average surpise: H(X) = X=x P (X = x) log P (X = x) Conditional entropy H(X Y ) = x Mutual information P (x, y) log P (x y) y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = H(X) + H(Y ) H(X, Y ) Kullback-Leibler divergence (relative entropy) KL(P (X) Q(X)) = x P (x) log P (x) Q(x) Relation between mutual information and KL: I(X; Y ) = KL(P (X, Y ) P (X)P (Y ))

Probability, Information and Entropy Surprise (for event X = x): log P (X = x) Entropy = average surpise: H(X) = X=x P (X = x) log P (X = x) Conditional entropy H(X Y ) = x Mutual information P (x, y) log P (x y) y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = H(X) + H(Y ) H(X, Y ) Kullback-Leibler divergence (relative entropy) KL(P (X) Q(X)) = x P (x) log P (x) Q(x) Relation between mutual information and KL: I(X; Y ) = KL(P (X, Y ) P (X)P (Y ))

Probability, Information and Entropy Surprise (for event X = x): log P (X = x) Entropy = average surpise: H(X) = X=x P (X = x) log P (X = x) Conditional entropy H(X Y ) = x Mutual information P (x, y) log P (x y) y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = H(X) + H(Y ) H(X, Y ) Kullback-Leibler divergence (relative entropy) KL(P (X) Q(X)) = x P (x) log P (x) Q(x) Relation between mutual information and KL: I(X; Y ) = KL(P (X, Y ) P (X)P (Y ))

Probability, Information and Entropy Surprise (for event X = x): log P (X = x) Entropy = average surpise: H(X) = X=x P (X = x) log P (X = x) Conditional entropy H(X Y ) = x Mutual information P (x, y) log P (x y) y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = H(X) + H(Y ) H(X, Y ) Kullback-Leibler divergence (relative entropy) KL(P (X) Q(X)) = x P (x) log P (x) Q(x) Relation between mutual information and KL: I(X; Y ) = KL(P (X, Y ) P (X)P (Y ))

Probability, Information and Entropy Surprise (for event X = x): log P (X = x) Entropy = average surpise: H(X) = X=x P (X = x) log P (X = x) Conditional entropy H(X Y ) = x Mutual information P (x, y) log P (x y) y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = H(X) + H(Y ) H(X, Y ) Kullback-Leibler divergence (relative entropy) KL(P (X) Q(X)) = x P (x) log P (x) Q(x) Relation between mutual information and KL: I(X; Y ) = KL(P (X, Y ) P (X)P (Y ))

Probability, Information and Entropy Surprise (for event X = x): log P (X = x) Entropy = average surpise: H(X) = X=x P (X = x) log P (X = x) Conditional entropy H(X Y ) = x Mutual information P (x, y) log P (x y) y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = H(X) + H(Y ) H(X, Y ) Kullback-Leibler divergence (relative entropy) KL(P (X) Q(X)) = x P (x) log P (x) Q(x) Relation between mutual information and KL: I(X; Y ) = KL(P (X, Y ) P (X)P (Y ))

Probabilistic Models and Inference Describe the world using probabilistic models. P (X, Y θ) X: observations, measurements, sensory input about the world. Y : Unobserved variables. θ: Parameters of our model. Inference: Given X = x, apply Bayes Rule to compute posterior distribution over unobserved variables of interest Y : P (Y x, θ) = P (x, Y θ) P (x θ) Maximum a posteriori: y MAP = argmax P (y x, θ). y Mean: y mean = E P (Y x,θ) [Y ]. Minimize Loss: y Loss = argmin y E P (Y x,θ) [Loss(Y )].

Probabilistic Models and Learning It is typically relatively easy to specify by hand high level structural information about P (X, Y θ) from prior knowledge. Much harder to specify exact parameters θ describing the joint distribution. (Unsupervised) Learning: Given examples of x 1, x 2,... find parameters θ that best explains examples. Typically by maximum likelihood: θ ML = argmax θ P (x 1, x 2,... θ) Alternatives: maximum a posteriori learning, Bayesian learning. (Supervised) Learning: Given y 1, y 2,... as well. Often models will come in a series M 1, M 2,... of increasing complexity. Each gives a joint distribution P (X, Y θ i, M i ). We would like to select one of the right size to avoid over-fitting or under-fitting.

Speech Recognition Y = word sequence, P (Y θ) = language model, X = acoustic signal, P (X Y, θ) = acoustic model.

Machine Translation Y = sentence, P (Y θ) = language model, X = foreign sentence, P (X Y, θ) = translation model.

Image Denoising Y = original image, X = noisy image, P (Y θ) = image model, P (X Y, θ) = noise model. Also: deblurring, inpainting, superresolution...

Simultaneous Localization and Mapping Y = map & location, X = sensor readings, P (Y θ) = map prior, P (X Y, θ) = observation model.

Collaborative Filtering Y = user preferences & item features X = ratings & sales records.

Spam Detection Y = spam or not? X = text, From:, To: etc.

Genome Wide Association Studies Y = associations, X = DNA, phenotypes, diseases.

Bayesian theory for representing beliefs Goal: to represent the beliefs of learning agents. Cox Axioms lead to the following: If plausibilities/beliefs are represented by real numbers, then the only reasonable and consistent way to manipulate them is Bayes rule. The Dutch Book Theorem: If you are willing to bet on your beliefs, then unless they satisfy Bayes rule there will always be a set of bets ( Dutch book ) that you would accept which is guaranteed to lose you money, no matter what outcome! Frequency vs belief interpretation of probabilities.

Cox Axioms Consider a robot. In order to behave intelligently the robot should be able to represent beliefs about propositions in the world: my charging station is at location (x,y,z) my rangefinder is malfunctioning that stormtrooper is hostile We want to represent the strength of these beliefs numerically in the brain of the robot, and we want to know what rules (calculus) we should use to manipulate those beliefs.

Cox Axioms Let s use b(x) to represent the strength of belief in (plausibility of) proposition x. 0 b(x) 1 b(x) = 0 x is definitely not true b(x) = 1 x is definitely true b(x y) strength of belief that x is true given that we know y is true Cox Axioms (Desiderata): Strengths of belief (degrees of plausibility) are represented by real numbers Qualitative correspondence with common sense Consistency If a conclusion can be reasoned in more than one way, then every way should lead to the same answer. The robot always takes into account all relevant evidence. Equivalent states of knowledge are represented by equivalent plausibility assignments. Consequence: Belief functions (e.g. b(x), b(x y), b(x, y)) must satisfy the rules of probability theory, including Bayes rule. (see Jaynes, Probability Theory: The Logic of Science)

Dutch Book Theorem Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 { x is true win 1 x is false lose 9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a Dutch book) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A at 3 : 7 B at 2 : 8 A B at 4 : 6 But then: A B win + 3 8 + 4 = 1 A B win 7 + 2 + 4 = 1 A B win + 3 + 2 6 = 1 The only way to guard against Dutch books is to ensure that your beliefs are coherent: i.e. satisfy the rules of probability.

Dutch Book Theorem Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 { x is true win 1 x is false lose 9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a Dutch book) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A at 3 : 7 B at 2 : 8 A B at 4 : 6 But then: A B win + 3 8 + 4 = 1 A B win 7 + 2 + 4 = 1 A B win + 3 + 2 6 = 1 The only way to guard against Dutch books is to ensure that your beliefs are coherent: i.e. satisfy the rules of probability.

Dutch Book Theorem Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 { x is true win 1 x is false lose 9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a Dutch book) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A at 3 : 7 B at 2 : 8 A B at 4 : 6 But then: A B win + 3 8 + 4 = 1 A B win 7 + 2 + 4 = 1 A B win + 3 + 2 6 = 1 The only way to guard against Dutch books is to ensure that your beliefs are coherent: i.e. satisfy the rules of probability.

Dutch Book Theorem Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 { x is true win 1 x is false lose 9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a Dutch book) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A at 3 : 7 B at 2 : 8 A B at 4 : 6 But then: A B win + 3 8 + 4 = 1 A B win 7 + 2 + 4 = 1 A B win + 3 + 2 6 = 1 The only way to guard against Dutch books is to ensure that your beliefs are coherent: i.e. satisfy the rules of probability.

Dutch Book Theorem Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: x at 1 : 9 { x is true win 1 x is false lose 9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a Dutch book) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A at 3 : 7 B at 2 : 8 A B at 4 : 6 But then: A B win + 3 8 + 4 = 1 A B win 7 + 2 + 4 = 1 A B win + 3 + 2 6 = 1 The only way to guard against Dutch books is to ensure that your beliefs are coherent: i.e. satisfy the rules of probability.

Bayesian Learning Apply the basic rules of probability to learning from data. Problem specification: Data: D = {x 1,..., x n } Models: M 1, M 2, etc. Parameters: θ i (per model) Prior probability of models: P (M i ). Prior probabilities of model parameters: P (θ i M i ) Model of data given parameters (likelihood model): P (X, Y θ i, M i ) Data probability (likelihood) P (D θ i, M i ) = n i=1 y i P (x i, y i θ i, M i ) L i (θ i ) provided the data are independently and identically distributed (iid). Parameter learning (posterior): P (θ i D, M i ) = P (D θ i, M i )P (θ i M i ) ; P (D M i ) = P (D M i ) dθ i P (D θ i, M i )P (θ M i ) P (D M i ) is called the marginal likelihood or evidence for M i. It is proportional to the posterior probability model M i being the one that generated the data. Model selection: P (M i D) = P (D M i)p (M i ) P (D)

Bayesian Learning Apply the basic rules of probability to learning from data. Problem specification: Data: D = {x 1,..., x n } Models: M 1, M 2, etc. Parameters: θ i (per model) Prior probability of models: P (M i ). Prior probabilities of model parameters: P (θ i M i ) Model of data given parameters (likelihood model): P (X, Y θ i, M i ) Data probability (likelihood) P (D θ i, M i ) = n i=1 y i P (x i, y i θ i, M i ) L i (θ i ) provided the data are independently and identically distributed (iid). Parameter learning (posterior): P (θ i D, M i ) = P (D θ i, M i )P (θ i M i ) ; P (D M i ) = P (D M i ) dθ i P (D θ i, M i )P (θ M i ) P (D M i ) is called the marginal likelihood or evidence for M i. It is proportional to the posterior probability model M i being the one that generated the data. Model selection: P (M i D) = P (D M i)p (M i ) P (D)

Bayesian Learning Apply the basic rules of probability to learning from data. Problem specification: Data: D = {x 1,..., x n } Models: M 1, M 2, etc. Parameters: θ i (per model) Prior probability of models: P (M i ). Prior probabilities of model parameters: P (θ i M i ) Model of data given parameters (likelihood model): P (X, Y θ i, M i ) Data probability (likelihood) P (D θ i, M i ) = n i=1 y i P (x i, y i θ i, M i ) L i (θ i ) provided the data are independently and identically distributed (iid). Parameter learning (posterior): P (θ i D, M i ) = P (D θ i, M i )P (θ i M i ) ; P (D M i ) = P (D M i ) dθ i P (D θ i, M i )P (θ M i ) P (D M i ) is called the marginal likelihood or evidence for M i. It is proportional to the posterior probability model M i being the one that generated the data. Model selection: P (M i D) = P (D M i)p (M i ) P (D)

Bayesian Learning Apply the basic rules of probability to learning from data. Problem specification: Data: D = {x 1,..., x n } Models: M 1, M 2, etc. Parameters: θ i (per model) Prior probability of models: P (M i ). Prior probabilities of model parameters: P (θ i M i ) Model of data given parameters (likelihood model): P (X, Y θ i, M i ) Data probability (likelihood) P (D θ i, M i ) = n i=1 y i P (x i, y i θ i, M i ) L i (θ i ) provided the data are independently and identically distributed (iid). Parameter learning (posterior): P (θ i D, M i ) = P (D θ i, M i )P (θ i M i ) ; P (D M i ) = P (D M i ) dθ i P (D θ i, M i )P (θ M i ) P (D M i ) is called the marginal likelihood or evidence for M i. It is proportional to the posterior probability model M i being the one that generated the data. Model selection: P (M i D) = P (D M i)p (M i ) P (D)

Bayesian Learning Apply the basic rules of probability to learning from data. Problem specification: Data: D = {x 1,..., x n } Models: M 1, M 2, etc. Parameters: θ i (per model) Prior probability of models: P (M i ). Prior probabilities of model parameters: P (θ i M i ) Model of data given parameters (likelihood model): P (X, Y θ i, M i ) Data probability (likelihood) P (D θ i, M i ) = n i=1 y i P (x i, y i θ i, M i ) L i (θ i ) provided the data are independently and identically distributed (iid). Parameter learning (posterior): P (θ i D, M i ) = P (D θ i, M i )P (θ i M i ) ; P (D M i ) = P (D M i ) dθ i P (D θ i, M i )P (θ M i ) P (D M i ) is called the marginal likelihood or evidence for M i. It is proportional to the posterior probability model M i being the one that generated the data. Model selection: P (M i D) = P (D M i)p (M i ) P (D)

Bayesian Learning: A coin toss example Coin toss: One parameter q the odds of obtaining heads So our space of models is the set of distributions over q [0, 1]. Learner A believes model M A : all values of q are equally plausible; Learner B believes model M B : more plausible that the coin is fair (q 0.5) than biased. A: α 1 = α 2 = 1.0 B: α 1 = α 2 = 4.0 Both prior beliefs can be described by the Beta distribution: p(q α 1, α 2 ) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2 ) = Beta(q α 1, α 2 ) where B is the (beta) function which normalizes the distribution: B(α 1, α 2 ) = 1 0 t α 1 1 (1 t) α 2 1 dt = Γ(α 1)Γ(α 2 ) Γ(α 1 + α 2 )

Bayesian Learning: A coin toss example Coin toss: One parameter q the odds of obtaining heads So our space of models is the set of distributions over q [0, 1]. Learner A believes model M A : all values of q are equally plausible; Learner B believes model M B : more plausible that the coin is fair (q 0.5) than biased. 2 1.8 1.6 1.4 1.2 P(q) 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q A: α 1 = α 2 = 1.0 B: α 1 = α 2 = 4.0 Both prior beliefs can be described by the Beta distribution: p(q α 1, α 2 ) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2 ) = Beta(q α 1, α 2 ) where B is the (beta) function which normalizes the distribution: B(α 1, α 2 ) = 1 0 t α 1 1 (1 t) α 2 1 dt = Γ(α 1)Γ(α 2 ) Γ(α 1 + α 2 )

Bayesian Learning: A coin toss example Coin toss: One parameter q the odds of obtaining heads So our space of models is the set of distributions over q [0, 1]. Learner A believes model M A : all values of q are equally plausible; Learner B believes model M B : more plausible that the coin is fair (q 0.5) than biased. 2 2.5 1.8 1.6 2 1.4 1.2 1.5 P(q) 1 P(q) 0.8 1 0.6 0.4 0.5 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q A: α 1 = α 2 = 1.0 B: α 1 = α 2 = 4.0 Both prior beliefs can be described by the Beta distribution: p(q α 1, α 2 ) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2 ) = Beta(q α 1, α 2 ) where B is the (beta) function which normalizes the distribution: B(α 1, α 2 ) = 1 0 t α 1 1 (1 t) α 2 1 dt = Γ(α 1)Γ(α 2 ) Γ(α 1 + α 2 )

Bayesian Learning: A coin toss example Coin toss: One parameter q the odds of obtaining heads So our space of models is the set of distributions over q [0, 1]. Learner A believes model M A : all values of q are equally plausible; Learner B believes model M B : more plausible that the coin is fair (q 0.5) than biased. 2 2.5 1.8 1.6 2 1.4 1.2 1.5 P(q) 1 P(q) 0.8 1 0.6 0.4 0.5 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q A: α 1 = α 2 = 1.0 B: α 1 = α 2 = 4.0 Both prior beliefs can be described by the Beta distribution: p(q α 1, α 2 ) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2 ) = Beta(q α 1, α 2 ) where B is the (beta) function which normalizes the distribution: B(α 1, α 2 ) = 1 0 t α 1 1 (1 t) α 2 1 dt = Γ(α 1)Γ(α 2 ) Γ(α 1 + α 2 )

Bayesian Learning: A coin toss example Now we observe a toss. Two possible outcomes: p(h q) = q p(t q) = 1 q Suppose our single coin toss comes out heads. The probability of the observed data (likelihood) is: p(h q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2 ) q q (α 1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2 )

Bayesian Learning: A coin toss example Now we observe a toss. Two possible outcomes: p(h q) = q p(t q) = 1 q Suppose our single coin toss comes out heads. The probability of the observed data (likelihood) is: p(h q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2 ) q q (α 1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2 )

Bayesian Learning: A coin toss example Now we observe a toss. Two possible outcomes: p(h q) = q p(t q) = 1 q Suppose our single coin toss comes out heads. The probability of the observed data (likelihood) is: p(h q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2 ) q q (α 1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2 )

Bayesian Learning: A coin toss example Now we observe a toss. Two possible outcomes: p(h q) = q p(t q) = 1 q Suppose our single coin toss comes out heads. The probability of the observed data (likelihood) is: p(h q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2 ) q q (α 1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2 )

Bayesian Learning: A coin toss example Now we observe a toss. Two possible outcomes: p(h q) = q p(t q) = 1 q Suppose our single coin toss comes out heads. The probability of the observed data (likelihood) is: p(h q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2 ) q q (α 1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2 )

Bayesian Learning: A coin toss example Now we observe a toss. Two possible outcomes: p(h q) = q p(t q) = 1 q Suppose our single coin toss comes out heads. The probability of the observed data (likelihood) is: p(h q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2 ) q q (α 1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2 )

Bayesian Learning: A coin toss example Now we observe a toss. Two possible outcomes: p(h q) = q p(t q) = 1 q Suppose our single coin toss comes out heads. The probability of the observed data (likelihood) is: p(h q) = q Using Bayes Rule, we multiply the prior, p(q) by the likelihood and renormalise to get the posterior probability: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2 ) q q (α 1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2 )

Bayesian Learning: A coin toss example A B 2 2.5 1.8 1.6 2 1.4 Prior P(q) 1.2 1 0.8 P(q) 1.5 1 0.6 0.4 0.5 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q Beta(q 1, 1) Beta(q 4, 4) 2.5 2.5 2 2 Posterior P(q) 1.5 1 P(q) 1.5 1 0.5 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q Beta(q 2, 1) Beta(q 5, 4)

Bayesian Learning: A coin toss example What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 This is still straightforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2 ) Beta(q α 1 + 3, α 2 + 3)

Bayesian Learning: A coin toss example What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 This is still straightforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2 ) Beta(q α 1 + 3, α 2 + 3)

Bayesian Learning: A coin toss example What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 This is still straightforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2 ) Beta(q α 1 + 3, α 2 + 3)

Bayesian Learning: A coin toss example What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 This is still straightforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2 ) Beta(q α 1 + 3, α 2 + 3)

Bayesian Learning: A coin toss example What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 This is still straightforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2 ) Beta(q α 1 + 3, α 2 + 3)

Bayesian Learning: A coin toss example What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 This is still straightforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2 ) Beta(q α 1 + 3, α 2 + 3)

Bayesian Learning: A coin toss example What about multiple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 This is still straightforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2 ) Beta(q α 1 + 3, α 2 + 3) 2.5 3 2 2.5 2 1.5 P(q) P(q) 1.5 1 1 0.5 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q

Conjugate Priors Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P (x θ) = g(θ)f(x)e φ(θ)t T(x) with g(θ) the normalising constant. Given n iid observations, P ({x i } θ) = i ( ) P (x i θ) = g(θ) n e φ(θ)t i T(x i) f(x i ) i Thus, if the prior takes the conjugate form P (θ) = F (τ, ν)g(θ) ν e φ(θ)t τ with F (τ, ν) the normaliser, then the posterior is P (θ {x i }) P ({x i } θ)p (θ) g(θ) ν+n e φ(θ)t ( with the normaliser given by F ( τ + i T(x i), ν + n ). τ + i T(x i) )

Conjugate Priors Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P (x θ) = g(θ)f(x)e φ(θ)t T(x) with g(θ) the normalising constant. Given n iid observations, P ({x i } θ) = i ( ) P (x i θ) = g(θ) n e φ(θ)t i T(x i) f(x i ) i Thus, if the prior takes the conjugate form P (θ) = F (τ, ν)g(θ) ν e φ(θ)t τ with F (τ, ν) the normaliser, then the posterior is P (θ {x i }) P ({x i } θ)p (θ) g(θ) ν+n e φ(θ)t ( with the normaliser given by F ( τ + i T(x i), ν + n ). τ + i T(x i) )

Conjugate Priors Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P (x θ) = g(θ)f(x)e φ(θ)t T(x) with g(θ) the normalising constant. Given n iid observations, P ({x i } θ) = i ( ) P (x i θ) = g(θ) n e φ(θ)t i T(x i) f(x i ) i Thus, if the prior takes the conjugate form P (θ) = F (τ, ν)g(θ) ν e φ(θ)t τ with F (τ, ν) the normaliser, then the posterior is P (θ {x i }) P ({x i } θ)p (θ) g(θ) ν+n e φ(θ)t ( with the normaliser given by F ( τ + i T(x i), ν + n ). τ + i T(x i) )

Conjugate Priors Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P (x θ) = g(θ)f(x)e φ(θ)t T(x) with g(θ) the normalising constant. Given n iid observations, P ({x i } θ) = i ( ) P (x i θ) = g(θ) n e φ(θ)t i T(x i) f(x i ) i Thus, if the prior takes the conjugate form P (θ) = F (τ, ν)g(θ) ν e φ(θ)t τ with F (τ, ν) the normaliser, then the posterior is P (θ {x i }) P ({x i } θ)p (θ) g(θ) ν+n e φ(θ)t ( with the normaliser given by F ( τ + i T(x i), ν + n ). τ + i T(x i) )

Conjugate Priors Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P (x θ) = g(θ)f(x)e φ(θ)t T(x) with g(θ) the normalising constant. Given n iid observations, P ({x i } θ) = i ( ) P (x i θ) = g(θ) n e φ(θ)t i T(x i) f(x i ) i Thus, if the prior takes the conjugate form P (θ) = F (τ, ν)g(θ) ν e φ(θ)t τ with F (τ, ν) the normaliser, then the posterior is P (θ {x i }) P ({x i } θ)p (θ) g(θ) ν+n e φ(θ)t ( with the normaliser given by F ( τ + i T(x i), ν + n ). τ + i T(x i) )

Conjugate Priors Updating the prior to form the posterior was particularly easy in these examples. This is because we used a conjugate prior for an exponential family likelihood. Exponential family distributions take the form: P (x θ) = g(θ)f(x)e φ(θ)t T(x) with g(θ) the normalising constant. Given n iid observations, P ({x i } θ) = i ( ) P (x i θ) = g(θ) n e φ(θ)t i T(x i) f(x i ) i Thus, if the prior takes the conjugate form P (θ) = F (τ, ν)g(θ) ν e φ(θ)t τ with F (τ, ν) the normaliser, then the posterior is P (θ {x i }) P ({x i } θ)p (θ) g(θ) ν+n e φ(θ)t ( with the normaliser given by F ( τ + i T(x i), ν + n ). τ + i T(x i) )

Conjugate Priors The posterior given an exponential family likelihood and conjugate prior is: P (θ {x i }) = F ( τ + i T(x i), ν + n ) ( g(θ) ν+n exp [φ(θ) T τ + )] i T(x i) Here, φ(θ) i T(x i) τ ν is the vector of natural parameters is the vector of sufficient statistics are pseudo-observations which define the prior is the scale of the prior (need not be an integer) As new data come in, each one increments the sufficient statistics vector and the scale to define the posterior.

Conjugacy in the coin toss example Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x {0, 1}, can be written: P (x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter is the log odds log(q/(1 q)), and the sufficient statisticss (for multiple tosses) is the number of heads. The conjugate prior is P (q) = F (τ, ν) (1 q) ν e log(q/(1 q))τ = F (τ, ν) (1 q) ν τ log q τ log(1 q) e = F (τ, ν) (1 q) ν τ q τ which has the form of the Beta distribution F (τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posterior will be P (q {x i }) q α1 1 (1 q) α2 1 = Beta(q α 1, α 2 ), with α 1 = 1 + τ + ( i x i α 2 = 1 + (ν + n) τ + ) i x i If we observe a head, we add 1 to the sufficient statistic x i, and also 1 to the count n. This increments α 1. If we observe a tail we add 1 to n, but not to x i, incrementing α 2.

Conjugacy in the coin toss example Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x {0, 1}, can be written: P (x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter is the log odds log(q/(1 q)), and the sufficient statisticss (for multiple tosses) is the number of heads. The conjugate prior is P (q) = F (τ, ν) (1 q) ν e log(q/(1 q))τ = F (τ, ν) (1 q) ν τ log q τ log(1 q) e = F (τ, ν) (1 q) ν τ q τ which has the form of the Beta distribution F (τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posterior will be P (q {x i }) q α1 1 (1 q) α2 1 = Beta(q α 1, α 2 ), with α 1 = 1 + τ + ( i x i α 2 = 1 + (ν + n) τ + ) i x i If we observe a head, we add 1 to the sufficient statistic x i, and also 1 to the count n. This increments α 1. If we observe a tail we add 1 to n, but not to x i, incrementing α 2.

Conjugacy in the coin toss example Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x {0, 1}, can be written: P (x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter is the log odds log(q/(1 q)), and the sufficient statisticss (for multiple tosses) is the number of heads. The conjugate prior is P (q) = F (τ, ν) (1 q) ν e log(q/(1 q))τ = F (τ, ν) (1 q) ν τ log q τ log(1 q) e = F (τ, ν) (1 q) ν τ q τ which has the form of the Beta distribution F (τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posterior will be P (q {x i }) q α1 1 (1 q) α2 1 = Beta(q α 1, α 2 ), with α 1 = 1 + τ + ( i x i α 2 = 1 + (ν + n) τ + ) i x i If we observe a head, we add 1 to the sufficient statistic x i, and also 1 to the count n. This increments α 1. If we observe a tail we add 1 to n, but not to x i, incrementing α 2.

Conjugacy in the coin toss example Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x {0, 1}, can be written: P (x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter is the log odds log(q/(1 q)), and the sufficient statisticss (for multiple tosses) is the number of heads. The conjugate prior is P (q) = F (τ, ν) (1 q) ν e log(q/(1 q))τ = F (τ, ν) (1 q) ν τ log q τ log(1 q) e = F (τ, ν) (1 q) ν τ q τ which has the form of the Beta distribution F (τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posterior will be P (q {x i }) q α1 1 (1 q) α2 1 = Beta(q α 1, α 2 ), with α 1 = 1 + τ + ( i x i α 2 = 1 + (ν + n) τ + ) i x i If we observe a head, we add 1 to the sufficient statistic x i, and also 1 to the count n. This increments α 1. If we observe a tail we add 1 to n, but not to x i, incrementing α 2.

Conjugacy in the coin toss example Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x {0, 1}, can be written: P (x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter is the log odds log(q/(1 q)), and the sufficient statisticss (for multiple tosses) is the number of heads. The conjugate prior is P (q) = F (τ, ν) (1 q) ν e log(q/(1 q))τ = F (τ, ν) (1 q) ν τ log q τ log(1 q) e = F (τ, ν) (1 q) ν τ q τ which has the form of the Beta distribution F (τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posterior will be P (q {x i }) q α1 1 (1 q) α2 1 = Beta(q α 1, α 2 ), with α 1 = 1 + τ + ( i x i α 2 = 1 + (ν + n) τ + ) i x i If we observe a head, we add 1 to the sufficient statistic x i, and also 1 to the count n. This increments α 1. If we observe a tail we add 1 to n, but not to x i, incrementing α 2.

Conjugacy in the coin toss example Distributions are not always written in their natural exponential form. The Bernoulli distribution (a single coin flip) with parameter q and observation x {0, 1}, can be written: P (x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter is the log odds log(q/(1 q)), and the sufficient statisticss (for multiple tosses) is the number of heads. The conjugate prior is P (q) = F (τ, ν) (1 q) ν e log(q/(1 q))τ = F (τ, ν) (1 q) ν τ log q τ log(1 q) e = F (τ, ν) (1 q) ν τ q τ which has the form of the Beta distribution F (τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posterior will be P (q {x i }) q α1 1 (1 q) α2 1 = Beta(q α 1, α 2 ), with α 1 = 1 + τ + ( i x i α 2 = 1 + (ν + n) τ + ) i x i If we observe a head, we add 1 to the sufficient statistic x i, and also 1 to the count n. This increments α 1. If we observe a tail we add 1 to n, but not to x i, incrementing α 2.

Model selection in coin toss example We have seen how to update posteriors within each model. To study the choice of model, consider two more extreme models: fair and bent. A priori, we may think that fair is more probable, eg: p(fair) = 0.8, p(bent) = 0.2 For the bent coin, we assume all parameter values are equally likely, whilst the fair coin has a fixed probability: We make 10 tosses, and get: D = (T H T H T T T T T T).

Model selection in coin toss example We have seen how to update posteriors within each model. To study the choice of model, consider two more extreme models: fair and bent. A priori, we may think that fair is more probable, eg: p(fair) = 0.8, p(bent) = 0.2 For the bent coin, we assume all parameter values are equally likely, whilst the fair coin has a fixed probability: We make 10 tosses, and get: D = (T H T H T T T T T T).

Model selection in coin toss example We have seen how to update posteriors within each model. To study the choice of model, consider two more extreme models: fair and bent. A priori, we may think that fair is more probable, eg: p(fair) = 0.8, p(bent) = 0.2 For the bent coin, we assume all parameter values are equally likely, whilst the fair coin has a fixed probability: p(q bent) 1 0.5 p(q fair) 1 0.5 0 0 0.5 1 parameter, q 0 0 0.5 1 parameter, q We make 10 tosses, and get: D = (T H T H T T T T T T).

Model selection in coin toss example We have seen how to update posteriors within each model. To study the choice of model, consider two more extreme models: fair and bent. A priori, we may think that fair is more probable, eg: p(fair) = 0.8, p(bent) = 0.2 For the bent coin, we assume all parameter values are equally likely, whilst the fair coin has a fixed probability: p(q bent) 1 0.5 p(q fair) 1 0.5 0 0 0.5 1 parameter, q 0 0 0.5 1 parameter, q We make 10 tosses, and get: D = (T H T H T T T T T T).

Model selection in coin toss example Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P (D fair) = (1/2) 10 0.001 and for the bent model is: P (D bent) = dq P (D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002 Thus, the posterior for the models, by Bayes rule: ie, a two-thirds probability that the coin is fair. P (fair D) 0.0008, P (bent D) 0.0004, How do we make predictions? Could choose the fair model (model selection). Or could weight the predictions from each model by their probability (model averaging). Probability of H at next toss is: P (H D) = P (H fair)p (fair D) + P (H bent)p (bent D) = 2 3 1 2 + 1 3 3 12 = 5 12.

Model selection in coin toss example Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P (D fair) = (1/2) 10 0.001 and for the bent model is: P (D bent) = dq P (D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002 Thus, the posterior for the models, by Bayes rule: ie, a two-thirds probability that the coin is fair. P (fair D) 0.0008, P (bent D) 0.0004, How do we make predictions? Could choose the fair model (model selection). Or could weight the predictions from each model by their probability (model averaging). Probability of H at next toss is: P (H D) = P (H fair)p (fair D) + P (H bent)p (bent D) = 2 3 1 2 + 1 3 3 12 = 5 12.

Model selection in coin toss example Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P (D fair) = (1/2) 10 0.001 and for the bent model is: P (D bent) = dq P (D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002 Thus, the posterior for the models, by Bayes rule: ie, a two-thirds probability that the coin is fair. P (fair D) 0.0008, P (bent D) 0.0004, How do we make predictions? Could choose the fair model (model selection). Or could weight the predictions from each model by their probability (model averaging). Probability of H at next toss is: P (H D) = P (H fair)p (fair D) + P (H bent)p (bent D) = 2 3 1 2 + 1 3 3 12 = 5 12.

Model selection in coin toss example Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P (D fair) = (1/2) 10 0.001 and for the bent model is: P (D bent) = dq P (D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002 Thus, the posterior for the models, by Bayes rule: ie, a two-thirds probability that the coin is fair. P (fair D) 0.0008, P (bent D) 0.0004, How do we make predictions? Could choose the fair model (model selection). Or could weight the predictions from each model by their probability (model averaging). Probability of H at next toss is: P (H D) = P (H fair)p (fair D) + P (H bent)p (bent D) = 2 3 1 2 + 1 3 3 12 = 5 12.

Model selection in coin toss example Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P (D fair) = (1/2) 10 0.001 and for the bent model is: P (D bent) = dq P (D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002 Thus, the posterior for the models, by Bayes rule: ie, a two-thirds probability that the coin is fair. P (fair D) 0.0008, P (bent D) 0.0004, How do we make predictions? Could choose the fair model (model selection). Or could weight the predictions from each model by their probability (model averaging). Probability of H at next toss is: P (H D) = P (H fair)p (fair D) + P (H bent)p (bent D) = 2 3 1 2 + 1 3 3 12 = 5 12.

Model selection in coin toss example Which model should we prefer a posteriori (i.e. after seeing the data)? The evidence for the fair model is: P (D fair) = (1/2) 10 0.001 and for the bent model is: P (D bent) = dq P (D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002 Thus, the posterior for the models, by Bayes rule: ie, a two-thirds probability that the coin is fair. P (fair D) 0.0008, P (bent D) 0.0004, How do we make predictions? Could choose the fair model (model selection). Or could weight the predictions from each model by their probability (model averaging). Probability of H at next toss is: P (H D) = P (H fair)p (fair D) + P (H bent)p (bent D) = 2 3 1 2 + 1 3 3 12 = 5 12.

Bayesian Learning If an agent is learning parameters, it could report different aspects of the posterior (or likelihood). Bayesian Learning: Assumes a prior over the model parameters. Computes the posterior distribution of the parameters: P (θ D). Maximum a Posteriori (MAP) Learning: Assumes a prior over the model parameters P (θ). Finds a parameter setting that maximises the posterior: P (θ D) P (θ)p (D θ). Maximum Likelihood (ML) Learning: Does not assume a prior over the model parameters. Finds a parameter setting that maximises the likelihood function: P (D θ). Choosing between these and other alternatives may be a matter of definition, of goals, or of practicality In practice (outside the exponential family), the Bayesian ideal may be computationally challenging, and may need to be approximated at best. We will return to the Bayesian formulation on Thursday and Friday. For Tuesday and Wednesday we will look at ML and MAP learning in more complex graphical models.