Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014

What is this course about? Building Bayesian machine learning models Performing the inference of these models Evaluating Bayesian solutions Directed graphical models

What is it NOT about? Basic machine learning. No SVMs, No Neural Networks. Basic probability and statistics in detail. Advanced probability and statistics. Advanced Bayesian theory. Undirected graphical models. No MRFs, No CRFs.

Useful text

The term project Choose a data set from the provided set (or offer your own) Devise your model (draw its Plate diagram) that solves the related problem Build the inference algorithm (i.e. choose the inference method, derive the necessary equations) Implement your model Evaluate your model s success Interpret your results in a report of approximately 4 pages.

Definitions Sample space (Ω): A collection of all possible outcomes of a random experiment. Event (E): A question about the experiment with a yes/no answer. A subset of the sample space. Probability measure: A function that assigns a number P(A) to each event A.

Axioms of probability Axiom 1: Probability of an event is a non-negative real number: P(E) R, P(E) 0, E Ω Axiom 2: Probability of the entire sample space is 1: P(Ω) = 1. Axiom 3: P(E 1 E 2 ) = P(E 1 ) + P(E 2 ), where E 1 E 2 =.

Consequences Sum rule: P(E 1 E 2 ) = P(E 1 ) + P(E 2 ) P(E 1 E 2 ) P( ) = 0 All set theory is applicable. Most of the Boolean algebra is applicable.

Conditional probability Kolmogorov s definition: P(A B) = P(A B) P(B) a.k.a product rule. De Finetti introduces this formulation as an axiom. Consider the following example 1 : 1 http://blacklen.wordpress.com/2011/02/02/introduction-to-bayesianconditional-probability/

Definitions (2) Probability density function (PDF): Pr[a x b] = b Cumulative distribution function (CDF): F x (x) = PDF - CDF relationship: x a p x (x)dx p x (x)dx Pr[a x b] = F x (b) F x (a)

Definitions(3) Expected value: E p(x) [x] = x p x (x)dx Variance: Var p(x) [x] = E p(x) [(x E p(x) [x]) 2 ] = E p(x) [x 2 ] (E p(x) [x]) 2 Standard Deviation: σ(x) = Var p(x) [x]

Definitions(4) Joint PDF Pr[a x b c y d] = Covariance: b d a c cov(x, y) = E[(x E[x])(y E[y])] p xy (x, y)dydx Marginal probability (sum rule): p(x) = p(x, y)dy

Normal distribution PDF: N (x µ, σ 2 ) = 1 (x µ) 2 σ 2π e 2σ 2 CDF: [ ( )] 1 x µ 1 + erf 2 2σ 2 where erf(x) = 1 x π x e t2 dt. Mean: µ Variance: σ 2

Normal distribution (2) 2 PDF CDF Std. Dev 2 http://en.wikipedia.org/wiki/normal_distribution

Multivariate normal distribution PDF: N (x µ, Σ) = (2π) D 2 Σ 1 2 e 1 2 (x µ)t Σ 1 (x µ) CDF: N/A. Mean: µ Variance: Σ

Multivariate normal distribution (2) 3 3 http://en.wikipedia.org/wiki/multivariate_normal_distribution

Central limit theorem Let x 1, x 2,, x N be N random variables with means µ 1, µ2,, µ N and standard deviations σ 1, σ 2,, σ N. Then the following variate X NORM = N i=1 x i N i=1 µ i N i=1 σ2 i has a limiting CDF which approaches a normal distribution.

Bayes Theorem p(θ x) = p(x θ)p(θ) p(x) x X is an observable in the sample space X. θ is the vector of model parameters. It is an index to a frequentist, and a random variable for a Bayesian. p(x θ): likelihood p(θ): prior p(θ x): posterior p(x): evidence

Independence and Conditional Independence Independence: P(E 1 E 2 ) = P(E 1 )P(E 2 ) Conditional independence: P(E 1 E 2 E 3 ) = P(E 1 E 3 )P(E 2 E 3 )

Independent and identically distributedness (i.i.d) Let x 1, x 2,, x N be N random variables corresponding to N observations of an experiment. They are defined to be independent and identically distributed (i.i.d) random variables if: All random variables x i have the same probability distribution. All pairs of observation events are independent.

Exchangeability The random variables (x 1, x 2,, x N ) are exchangeable if for any permutation π, the following equality holds p(x 1, x 2,, x N ) = p(x π1, x π2,, x πn ).

Frequentist and Bayesian views Is probability subjective or objective? For frequentists, it is an objective measure: p(e) = nr times event E occurs nr trials For Bayesians, it is a measure of likeliness that event E occurs. The classical view, based on physical considerations of symmetry, in which one should be obliged to give the same probability to such symmetric cases. But which symmetry? And, in any case, why? The original sentence becomes meaningful if reversed: the symmetry is probabilistically significant, in someone s opinion, if it leads him to assign the probabilities to such events. de Finetti, 1970/74, Preface,xi-xii

Motivation 1: De Finetti s Theorem A sequence of random variables (x 1, x 2,, x N ) is infinitely exchangeable iff, for any N, p(x 1, x 2,, x N ) = N i=1 p(x i θ)p(dθ) Here, P(dθ) = p(θ)dθ if θ has a density. Implications: Exchangeability can be checked from right hand side. There must exist a parameter θ! There must exist a likelihood p(x θ)! There must exist a distribution P on θ These three components are prerequisites for the data to be conditionally independent!

Motivation 2: Statistical Decision Theory Loss function: l(θ, δ(x)) where δ(x) is a decision based on data x. Determines the penalty for predicting δ(x) if θ is the true parameter. e.g. Squared loss: l(θ, δ(x)) = (θ δ(x)) 2. However, δ(x) does not have to be an estimate of θ.

Frequentist Risk R(θ, δ) = E[l(θ, δ(x))] for a fixed θ and differend x X. How to decide which decision is better: Admissibility: Never dominated everywhere by another decision. Not practical, a decision rarely dominates another in real cases. Restricted classes of procedures: For instance, we can restrict ourselves to the unbiased case (i.e. E θ [ˆθ] = θ). Many good procedures are biased. Moreover, some unbiased procedures are inadmissible. Minimax: Choose the one with lower maximum worst-case risk.

Motivation 3: Birnbaum s Principles Conditionality principle: If an experiment concerning inference about θ is chosen from a collection of possible experiments independently, then any experiment not chosen is irrelevant to the inference. Likelihood Principle: The relevant information in any inference about θ after x is observed is contained entirely in the likelihood function. Sufficiency Principle: If two different observations x, y are such that T (x) = T (y) for sufficient statistic T, then inference based on x and y should be the same.

Bayesian decision theory Posterior risk: ρ(π, δ(x)) = l(θ, δ(x))p(θ x)d θ where p(θ x) p(x θ)π(θ). The Bayes action δ (x) for any fixed x is the decision δ(x) that minimizes the posterior risk.

Bayesian decision theory (2) For example, let us calculate the posterior risk for l(θ, δ(x)) = (θ δ(x)) 2 : ρ = (θ δ(x)) 2 p(θ x)dθ = δ(x) 2 2δ(x) θp(θ x)dθ + θ 2 p(θ x)dθ and the Bayes action ρ δ(x) = 2δ(x) 2 θp(θ x)dθ = 0, δ (x) = θp(θ x)d θ turns out to be the posterior mean! For l(θ, δ(x)) = θ δ(x), the optimal decision is to choose the posterior median.

Comparison Both approaches use loss functions. Frequentists integrate out X. Bayesians integrate out θ.

Posterior predictive distribution Given a posterior p(θ x) and a new observation x, the posterior predictive distribution is p(x x) = p(x θ)p(θ x)dθ = E p(θ x) [p(x θ)]

Supervised learning Given a set of observations: x 1, x 2,, x N and the corresponding outcomes (labels) y 1, y 2,, y N, learn a function y = f (x) A naive solution is linear regression 4 : y = w T x. 4 http://commons.wikimedia.org/wiki/file:linear-regression.svg

Types of supervised learning Classification: y a, b, c,, k Regression: y R Semi-supervised learning: A (large) subset of the training set does not have labels. Active learning: The model asks labels of the most important observations. Structured output learning: y is a structure (e.g. a graph)

Unsupervised learning Given a set of observations: x 1, x 2,, x N, learn a model that does X. A commonplace X is to infer data chunks, called clusters. This problem is called clustering 5. 5 http://en.wikipedia.org/wiki/cluster_analysis

Discriminative versus Generative models Joint model: p(x, y) Generative model: p(y x) = p(y)p(x y) p(x) Discriminative model deals directly with p(y x).

Parametric and nonparametric models Parametric model: The structure of the training data is stored in a predetermined set of parameters. These parameters are sufficient for prediction, no need to store the training data. Non-parametric model: Number of parameters in the model grows with the training data size. Training data also has to be stored for prediction.