BAYESIAN MACHINE LEARNING: THEORETICAL FOUNDATIONS. Bayesian Machine Learning, Frederic Pennerath

Size: px

Start display at page:

Download "BAYESIAN MACHINE LEARNING: THEORETICAL FOUNDATIONS. Bayesian Machine Learning, Frederic Pennerath"

Emmeline Grant
5 years ago
Views:

1 BAYESIAN MACHINE LEARNING: THEORETICAL FOUNDATIONS

2 Overview 1. Define a model family as a type of joint distribution P X = P X 1,, X m Digression on Bayesian Network. Predict / estimate outputs from a learnt model 3. Learn / estimate a model distribution from data x 1 i,, x m i 1 i n Likelihood & Bayesian inference. Conjugate Prior Bayes estimator, MAP, MLE MLE of discrete distribution

3 DEFINING A MODEL

4 Discriminative Models Produce output samples from input: Model parameters θ Input X Model P Y = y X = x, θ Output Y Example: Y = c 0 + c 1 X 1 + c X + ε where ε~n 0, σ ε P Y X1 =x 1,X =x y = 1 e y c 0 c1x1 cx σ ε πσ ε X = X 1, X T and θ = c 0, c 1, c, σ ε T

5 Generative Models Produce full samples: Model parameters θ Model P(Z = z θ) Output Z Example: P X1,X,Y x 1, x, y = 1 πσ 1 σ σ ε e x 1 m 1 σ 1 e x m σ e y c 0 c 1 x 1 c x σ ε Z = X 1, X, Y T and θ = m 1, σ 1, m, σ, c 0, c 1, c, σ ε T Descriptive models subsume discriminative models P Y, X θ P Y X, θ = P(X θ) = P Z X, Y θ P Z X, Y = y θ dy Most common in Bayesian Machine Learning / Described by Bayesian Networks

6 The curse of dimensionality: Back to the university wrestling club example Variable Values Meaning M yes, no Member of the wrestling club H 140, 145, 35, 40 Height W 40, 45, 135, 140 Weight G female, male Gender D never, sometimes, often Sport looking (dress code, etc) Number of parameters? Number of students? Defining P M H, W, G, D) requires: θ = = 400 parameters at least ~ students

7 Graphical Models Bayesian Networks A tool to specify joint distribution P X = P A, B,, G A -level specification : Directed acyclic graph (DAG) of vertices A, B,, G Specify P X as a product of factors Specify independence relations DAG + Conditional Probability Tables (CPT) Specify the numerical values of factors Fully specify P X

8 First level of Bayesian Network: a factorization model of joint distribution P A, G a, b,, g = P A a P B A=a b P C c P D B=b,C=c d P E C=c e P F B=b,D=d,E=e f P G F=f g

9 Second level of Bayesian Network: conditional probability tables (CPTs) A P A D B C P D B,C B A P B A P A, G a, b,, g = P A a P B A=a b P C c P D B=b,C=c d P E C=c e P F B=b,D=d,E=e f P G F=f g

10 Bayesian Networks and Causality Causality : event A is a causal variable for B if A and B are (strongly) dependent and A occurs before B Real world phenomenae have causal models Bayesian Networks naturally formalize causal problems Parents of variable V are immediate causes for V Causality spreads from orphan variables to childless ones However Bayesian Networks do not say anything about causality, only about dependence Rain Weather (Clouds) Sun Corn Corn (last year) Pest

11 Rules for independence: an intuitive interpretation Weather (clouds) Corn (last year) Rain Sun Pest A C B Rules : (A B) (A B) C Corn Predators Corn Corn last year P(corn corn last year) < P(corn no corn last year) Corn Corn last year Pest P(corn corn last year & no pest) = P(corn no corn last year & no pest) A D Rules : C B (A B) (A B) C (A B) D Corn Predators P(no corn predators) > P(no corn no predator) Corn Predators Pest P(no corn predators & pest) = P(no corn pest) Corn Predat. Corn last year P(no corn predators & corn last year) > P(no corn no predators & corn last year) A B Rules : Sun Pest P(dark pest) = P(dark no pest) C D (A B) (A B) C (A B) D Sun Pest Corn P(dark no pest & no corn) > P(dark pest & no corn) Sun Pest Pest next year P(dark no pest & no pest next year) > P(dark pest & no pest next year)

12 Rules for independence

13 Rules for independence

14 d-separation : a general theorem for independence Definition of d-separation : X and Y are d-separated given Z if for all path P from X to Y, P contains one of the blocking configurations: X A B C Y such that B Z X A B C Y or X A B C Y such that B Z X A B C Y such that B and all descendants of B Z Theorem: X and Y are independent given Z if and only if X X, Y Y, X and Y are d-separated given Z

15 d-separation : example

16 Example of modelling: Back to the university wrestling club Variable Values Meaning M yes, no Member of the wrestling club H 140, 145, 35, 40 Height W 40, 45, 135, 140 Weight G female, male Gender D never, sometimes, often Sport looking (dress code, etc) S yes,no Practices some sport Hidden or latent variable

17 Example of modelling: Back to the university wrestling club

18 Bayesian Network: continuous random variables Given continuous X of parents Y 1, Y k all discrete: Assume some parameterized distribution for X Y 1, Y k e.g. H G = g~n(μ g, σ g ) 4 parameters (μ m, σ m, μ f, σ f ) For continuous parents, introduce parameterized dependency with them, e.g : W G = g, H = h, M = m~n(μ0 g,m + μ1 g,m h, σ 0 g,m + σ 1 g,m h) parameters μ g,m, μ g,m, σ 0 g,m, σ 1 g,m g m,f,m y,n Reduce number of parameters and overfitting: 4+16 instead of parameters!

19 USING A MODEL or how to estimate the output of a model

20 Model output prediction Inputs X Model m Model parameters θ Outputs Y Prediction: deduction of output distribution Y from X and θ Weak Bayesian model: θ is known e.g. mean of Y: E Y X, θ = y P Y = y X, θ dy Strong Bayesian model: θ is uncertain, modelled by Θ~P Θ θ κ e.g. mean of Y: E Y X, κ = y P Y = y X, θ dy P Θ θ κ dθ

21 A very simple example: defining the model Requests Server Responses Processing time server ~N(μ, σ ) Specs say: μ μ 0 ± σ 0, σ σ T ± 0 With: μ 0 = 50 ms, σ 0 = 5 ms σ T = 10 ms Generative model: No input Output: processing time T Parameters: θ = (μ, σ) Normal processing time not realistic (why?) Weak Bayesian: θ = (μ 0, σ T ) Strong Bayesian: P Θ κ e μ μ 0 σ 0 δ σ, σ T Hyperparameters: κ = μ 0, σ 0, σ T

22 A very simple example: applying the model Weak Bayesian model: P T κ t e t μ 0 σ T Strong Bayesian model: P T κ t e t μ 0 σ T +σ 0 avec

23 LEARNING A MODEL or how to estimate the parameters of a model from the data

24 Model output prediction Inputs X Model m Model parameters θ Outputs Y Bayesian prediction: deduction of output Y from X and P Θ κ P Y X, κ = P Y X, θ P Θ κ θ dθ

25 Model estimation: the learning step Inputs X Model m Model parameters θ Outputs Y Estimation: induction of parameters θ from data/observations 1 O = o i 1 i n Replace P Θ κ by distribution P Θ κ, O Updates prediction: P Y X, κ, O = P Y X, θ θ Bayesian inference: infer P θ κ, O from P θ κ and O But how? P Θ κ,o θ dθ (1) o i = (x i, y i ) or o i = y i

Bayesian estimation: The Bayes rule Bayes rule (or theorem): Given events A and B Thomas Bayes (170-61) P A B P B = P B A P A obvious by definition: P A B P B = P A B The heart of bayesian

26 Bayesian estimation: The Bayes rule Bayes rule (or theorem): Given events A and B Thomas Bayes (170-61) P A B P B = P B A P A obvious by definition: P A B P B = P A B The heart of bayesian inference: Given a new observation O = o P θ o = P o θ P(O = o) P θ Likelihood L O θ of θ (not a distribution, why?) Posterior of θ (is a distribution) Normalization factor Prior of θ (is a distribution)

27 Bayesian estimation: fundamentally an online approach Hypothesis of i.i.d observations if studied system is stationary: P O θ o 1, o k = Given i.i.d observations O = o 1, o,, o k : P θ O, κ = i P O θ P O κ P O θ o i P θ κ P O θ o i i L O θ Processing order of observations does not matter P θ κ P θ O 1 O, κ L O1 θ L O θ P θ κ L O1 θ P θ O, κ L O θ P θ O 1, κ Batch or online processing of observations

28 Example of Bayesian estimation: Processing time server Requests Server No input Output: processing time T Model: θ = μ, σ, T θ~n(μ, σ ) ~N(μ, σ ) Responses 1. Define a prior on Θ: P Θ κ θ = 1 σ 0 π e μ μ 0 σ 0 δ σ, σ T. Observe T = t and apply Bayes rule: P Θ κ,t=t θ = P T = t θ P(T = t κ) P Θ κ θ = σ T 1 π e P T = t κ t μ σ T σ 0 1 π e μ μ 0 σ 0 δ σ, σ T

29 Example of Bayesian estimation: 3. Compute posterior: P μ, σ T = t e t μ σ μ μ 0 T σ 0 e μ μ 1 σ 1 δ σ, σ T δ σ, σ T μ T = t~n μ 1, σ 1 with t σ + μ 0 μ 1 = T σ 0 1 σ + 1 T σ 0 σ 1 1 = 1 σ + 1 T σ 0 e.g. if t = 56 ms then μ 1 = 51. ms, σ 1 = 4.47 ms! Why μ 1 so close to μ 0?

30 Example of Bayesian estimation: 4. Repeat observation process (assuming obs are i.i.d): Comments: μ T = t 1,, t n ~N n t σ i + 1 T σ μ n σ + 1, n T σ 0 σ + 1 T σ 0 μ is a weighted mean between observation average and prior Slow 1/ n decrease of standard deviation Importance of initial prior σ 0 too small slows down convergence σ 0 too high makes initial guess μ 0 useless n n

31 Choosing the prior P θ Philosophically, the prior should reflect a priori knowledge. The less is known, the more scattered should be the distribution (uniform prior). In practice, choosing a good prior can speed-up the convergence. The prior on parameters introduces new parameters (like μ 0, σ 0 ) called hyper parameters A compromise between two contradictory objectives: Choose representative but intractable prior. Choose tractable but unrealistic prior conjugate prior

32 Tractability issues Problem 1: enumerating all values for θ is impossible Introduce parameterized representation of P θ κ Problem : how to parameterize posterior P θ κ, O by P θ κ Either use approximate representations (sampling) Or use close forms for prior/posterior conjugate priors

33 Exact inference and the notion of conjugate prior In general no simple expression of posterior P θ O, κ P O θ P θ κ Special case of conjugate prior: when prior and posterior have analog closed forms Advantages: computation fast and easy, exact Limitation: might not fit reality. Likelihood P O θ parameters θ and Prior P θ κ and hyper parameters κ Posterior P θ O, κ Normal Y~N μ, σ with σ known Normal μ~n μ 0, σ 0 μ~n n y i σ + μ 0 σ 0 1 n σ + 1, n σ 0 σ + 1 σ 0 See more examples at

34 Making Bayesian decisions Prior P θ κ Loss L θ θ Observations O Bayesian Inference Posterior P θ O, κ Decision Parameter values θ Output of Bayesian estimation is a posterior distribution P θ O, κ Some problems require to choose values for θ for real-time prediction: e.g. navigation systems Choosing θ instead of real θ induces some cost or loss L θ θ NB: not anymore Bayesian strictly speaking

35 Bayes Estimators Prior P θ Loss L θ, θ Observations O Bayesian Inference Posterior P θ O Decision Parameter values θ Bayes estimator: select θ that minimizes average posterior risk Equiv. select θ = argmin θ E Θ O,κ L θ, θ Maximum A Posteriori (MAP): Bayes estimator with uniform risk Equiv. select θ as most probable θ: θ = argmax P Θ O,κ θ θ Maximum Likelihood (MLE): Maximum A Posteriori with uniform prior Equiv. select θ as θ = argmax θ P O θ, κ

36 Maximum A Posteriori Estimator (MAP) Assume uniform loss: 0 if θ= θ or constant (1) otherwise L θ, θ = 1 δ θ θ Then: θ MAP = argmin θ E Θ O,κ L θ, θ = argmin θ E Θ O,κ 1 δ θ θ = argmax E Θ O,κ δ θ θ = argmax θ θ δ θ θ P Θ O,κ θ dθ = argmax P Θ O,κ θ = argmax θ θ P O θ P O κ P θ κ θ MAP = argmax θ P O θ P θ κ

37 Maximum likelihood estimation (MLE) Equivalent to MAP with uniform prior: θ MLE = argmax θ P O θ P θ κ = argmax θ P O θ Often easier to work on log-likelihood Advantages: θ MLE = argmax θ logp O θ Products are transformed into sums Solve numerical precision problems (product of small probabilities)

38 i.i.d observations Common hypothesis: observations Z = z 1,, z n are i.i.d: Identically distributed Stationary process/model Independent random sampling on input space X (generally not true in real data) Consequence on MAP/MLE: n θ MAP = argmax θ L Z θ P θ = argmax θ i=1 L zi θ P θ n θ MAP = argmax θ logp θ + i=1 logl zi θ

39 EXAMPLE OF WEAK AND STRONG BAYESIAN ESTIMATION FOR THE CATEGORICAL DISTRIBUTION MLE of categorical distribution Conjugate prior of categorical distribution

40 Computing MLE of categorical CPTs Problem : estimate MLE of c P M G=g,H=h (yes) θ MLE = argmax θ L Z θ logl Z θ c = i=1 n logl zi θ c = 0 Given data z i = (g i, h i, m i, w i, d i ) P G,H,M,W,D = P G P H G P M G,H P W G,H,M P D G,M,W logl Zi θ = log P G g i + log P H h i g i + log P M m i g i, h i + log P W w i g i, h i, m i + log P D d i g i, m i, w logl Zi θ c = log P M m i g i, h i c

41 Computing MLE of categorical CPTs Three cases to distinguish: m i, g i, h i = yes, g, h log P M m i g i, h i c = log c c = 1 c g i, h i = no, g, h log p M m i g i, h i c = log 1 P M yes g, h c = log 1 c c = 1 1 c g i, h i g, h log P M m i g i, h i c = 0 logl Z c θ = 1 c N m i = yes, g i = g, h i = h 1 1 c N m i = no, g i = g, h i = h =0

42 Computing MLE of categorical CPTs c = N m i = yes, g i = g, h i = h N m i = yes, g i = g, h i = h + N m i = no, g i = g, h i = h c MLE = N m i = yes, g i = g, h i = h N g i = g, h i = h General result: MLE of categorical distribution amounts to compute frequency of occurrences in dataset Generalize to non Boolean categorical variables (Lagrangian optimization)

43 EXAMPLE OF WEAK AND STRONG BAYESIAN ESTIMATION FOR THE CATEGORICAL DISTRIBUTION MLE of categorical distribution Conjugate prior of categorical distribution

44 Dirichlet distribution: definition Definition: Dirichlet distribution Dir α of parameters α = α 1, α c s.t. α i, α i > 0 with P α p 1,, p C = B α = i Γ α i Γ i α i 1 B α C i=1 and Γ t p i α i 1 if c i=1 p i = 1 0 otherwise = 0 + x t 1 e x dt Properties: Samples of P α are categorical distributions Mode: p 1,, p C with i, p i = α i 1 C j=1 α j C Conjugate prior of categorical / multinomial distribution α 3 = α 3 = 4 α = α 3 = 5 α 1 = 6 α = 3 α 3 = 6 α 1 = α = 7 α 1 = 3 α = α 1 = 6

45 Dirichlet distribution: a conjugate prior of categorical distributions Given: c A variable Y of categorical distribution θ = p 1,, p C s.t. i=1 p i = 1 A Dirichlet prior P Θ κ = Dir α on θ n i.i.d obs. O = y 1,, y n of Y : P Θ O,κ θ P O θ P Θ κ θ j p yi i p i α i 1 N p i i i i N p i +α i 1 i i p i α i 1 Posterior P Θ O,κ is Dirichlet Dir N 1 + α 1,, N C + α C

46 Posterior of categorical CPT of Dirichlet prior Theorem: if priors of CPTs P pxj α Xj are independent, posteriors of CPTs P pxj O,α Xj are independent P Θ O,κ θ = L Z θ P Θ κ θ = L Zi θ P Θ κ θ i = P Xj par X j x j i P pxj α Xj p Xj i j j = P Xj par X j x j i P pxj α Xj p Xj j i = P pxj O,α Xj p Xj Consequence: j if CPT P X Y1 =v 1, Y k =v k x has prior Dir α 1,, α C, its posterior is Dir N(x = 1, y 1 = v 1, y k = v k ) + α 1,, N(x = C, y 1 = v 1, y k = v k ) + α C

47 Computing MAP of categorical CPTs: e.g compute MAP of c P M G=g,H=h (yes) for g = Man and h = 1. 80m c MAP = argmax c = mode c P PM g,h O,α M Dir N m i = yes, g i = g, h i = h + α y,g,h, N m i = no, g i = g, h i = h + α no,g,h c = N m i = yes, g i = g, h i = h + α yes,g,h 1 N m i = yes, g i = g, h i = h + N m i = no, g i = g, h i = h + α yes,g,h + α no,g,h c MAP = N m i = yes, g i = g, h i = h + α yes,g,h 1 N g i = g, h i = h + α yes,g,h + α no,g,h MAP consists in introducing faked examples (Laplace smoothing) Solve problem when MLE is undefined (0/0)

48 Bayes Estimation: A summary Bayesian inference: Produce posterior from prior & observations using Bayes rule Strong Bayesian models: model parameters are described by distributions The fully Bayesian approach has often to be broken Tractability/Scalability issue Application requirement to choose parameter values. The weak Bayesian approach uses Bayesian estimation: General case: Bayes Estimator With uniform loss: Maximum A Posteriori Estimator (MAP) With uniform prior: Maximum Likelihood Estimator (MLE)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample