Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

Learning problem Target: true distribution P that maybe correspond to M = K, θ Hypothesis space: specified probabilistic graphical models Data: set of instances sampled from P Learning goal: selecting a model M to construct the best approximation to M according to a performance metric 2

Learning tasks on graphical models Parameter learning / structure learning Completely observable / partially observable data Directed model / undirected model 3

Parameter learning in directed models Complete data We assume that the structure of the model is known consider learning parameters for a B with a given structure Goal: estimate CPDs from a dataset D = {x 1,..., x () } of independent, identically distributed (i.i.d.) training samples. Each training sample x n = x 1 (n),, xl (n) is a vector that every element x i (n) is known (no missing values, no hidden variables) 4

Density estimation review We use density estimation to solve this learning problem Density estimation: Estimating the probability density function P(x), given a set of data points x i drawn from it. Parametric methods: Assume that P(x) in terms of a specific functional form which has a number of adjustable parameters MLE and Bayesian estimate MLE: eed to determine θ given {x 1,, x () } MLE overfitting problem Bayesian estimate: Probability distribution P(θ) over spectrum of hypotheses eeds prior distribution on parameters i=1 5

Density estimation: Graphical model i.i.d assumption θ X (1) X (2) X () θ X (i) i = 1,..., α hyperparametrs α θ α θ X (1) X (2) X () X (n) n = 1,..., 6

Maximum Likelihood Estimation (MLE) Likelihood is the conditional probability of observations D = x (1), x (2),, x () given the value of parameters θ Assuming i.i.d. (independent, identically distributed) samples P D θ = P x (1),, x () θ = likelihood of θ w.r.t. the samples P(x (n) θ) Maximum Likelihood estimation 7 θ ML = argmax θ θ ML = argmax θ i=1 P D θ = argmax θ ln p x (i) θ P(x (n) θ) MLE has closed form solution for many parametric distributions

MLE: Bernoulli distribution Given: D = x (1), x (2),, x (), m heads (1), m tails (0): p x θ = θ x 1 θ 1 x p x = 1 θ = θ p D θ = p(x n θ) = x n θ 1 θ 1 x n ln p D θ = ln p(x n θ) = {x n ln θ + (1 x n ) ln 1 θ } ln p D θ θ = 0 θ ML = i=1 x (n) = m

MLE: Multinomial distribution Multinomial distribution (on variable with K state): Parameter space: θ = θ 1,, θ K θ i 0,1 K k=1 θ k = 1 P x θ = K x θ k k k=1 P x k = 1 = θ k θ 2 9 Variable: 1-of-K coding x = x 1,, x K x k {0,1} K k=1 x k = 1 θ 1 θ 3 θ 1 + θ 2 + θ 3 = 1 where θ i 0,1 that is a simplex showing the set of valid parameters

MLE: Multinomial distribution D = x (1), x (2),, x () P D θ = P(x n θ) = K k=1 θ k x k (n) = L θ, λ = ln p D θ + λ(1 θ k ) K k=1 K k=1 θ k x k (n) m k = K k=1 x k (n) m k = θ k = x k (n) = m k 10

MLE: Gaussian with unknown μ ln P(x n μ) = 1 2 ln 2πσ 1 2σ 2 x n μ 2 ln P D μ μ = 0 μ ln p x (n) μ = 0 1 σ 2 x(n) μ = 0 μ ML = 1 x n 12

Bayesian approach Parameters θ as random variables with a priori distribution utilizes the available prior information about the unknown parameter As opposed to ML estimation, it does not seek a specific point estimate of the unknown parameter vector θ Samples D convert the prior densities P θ density P θ D into a posterior Keep track of beliefs about θ s values and uses these beliefs for reaching conclusions 13

Maximum A Posteriori (MAP) estimation MAP estimation θ MAP = argmax p θ D θ Since p θ D p D θ p(θ) θ MAP = argmax θ p D θ p(θ) Example of prior distribution: p θ = (θ 0, σ 2 ) 14

Bayesian approach: Predictive distribution Given a set of samples D = x i i=1, a prior distribution on the parameters P(θ), and the form of the distribution P x θ We find P θ D and use it to specify P x = P(x D) on new data as an estimate of P(x): P x D = P x, θ D dθ = P x D, θ P θ D dθ = P x θ P θ D dθ Predictive distribution Analytical solutions exist for very special forms of the involved functions 15

Conjugate Priors We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties Choosing a prior such that the posterior distribution that is proportional to p(d θ)p(θ) will have the same functional form as the prior. α, D α P(θ α ) P D θ P(θ α) Having the same functional form 16

Prior for Bernoulli Likelihood Beta distribution over θ [0,1]: Beta θ α 1, α 0 θ α 1 1 1 θ α 0 1 Beta θ α 1, α 0 = Γ(α 0 + α 1 ) Γ(α 0 )Γ(α 1 ) θα 1 1 1 θ α 0 1 E θ = α 1 α 0 + α α 1 1 θ = α 0 1 + α 1 1 most probable θ Beta distribution is the conjugate prior of Bernoulli: P x θ = θ x 1 θ 1 x 17

Beta distribution 18

Benoulli likelihood: posterior Given: D = x (1), x (2),, x (), m heads (1), m tails (0) p θ D p D θ p(θ) = i=1 θ x i 1 θ 1 x i Beta θ α 1, α 0 θ m+α1 1 1 θ m+α 0 1 p θ D Beta θ α 1, α 0 α 1 = α 1 + m α 0 = α 0 + m θ α 1 1 1 θ α 0 1 m = x (i) i=1 19

Example Prior Beta: α 0 = α 1 = 2 p x θ = θ x 1 θ 1 x Bernoulli p x = 1 θ θ Posterior Beta:α 1 = 5, α 0 = 2 Given: D = x (1), x (2),, x () : m heads (1), m tails (0) θ α 0 = α 1 = 2 20 θ θ MAP = argmax θ D = 1,1,1 = 3, m = 3 P θ D = α 1 1 α 1 1 + α 0 1 = 4 5

Benoulli: Predictive distribution Training samples: D = x (1),, x () P θ = Beta θ α 1, α 0 θ α1 1 1 θ α 0 1 P θ D = Beta θ α 1 + m, α 0 + m θ α1+m 1 1 θ α 0+ m 1 21 P x D = P x θ P θ D dθ = E P θ D P(x θ) P x = 1 D = E P θ D θ = α 1 + m α 0 + α 1 +

Dirichlet distribution Input space: θ = θ 1,, θ K T θ k 0,1 K k=1 θ k = 1 = P θ α Γ(α) K α θ k 1 k k=1 K α θ k 1 k Γ α 1 Γ(α K ) k=1 α = K k=1 α k E θ k = α α k θ k = α k 1 α K 22

Dirichlet distribution: Examples α = [0.1,0.1,0.1] α = [1,1,1] α = [10,10,10] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of α correspond to more confidence on the prior belief (i.e., more imaginary samples) 23

Dirichlet distribution: Example α = [2,2,2] α = [20,2,2] 24

Multinomial distribution: Prior Dirichlet distribution is the conjugate prior of Multinomial P θ D, α P D θ P θ α K m θ k +α k 1 k k=1 m = m 1,, m K T sufficient statistics of data P θ D, α = Dir θ α + m P θ α θ~dir(α 1,, α K ) P θ D, α θ~dir(α 1 + m 1,, α K + m K ) 25

Multinomial: Predictive distribution Training samples: D = x (1),, x () P θ = Dir θ α 1,, α K P θ D = Dir θ α 1 + m 1,, α K + m K P x D = P x θ P θ D dθ = E P θ D P(x θ) P x k = 1 D = E P θ D θ k = α k + m k α + Larger α more confidence in our prior 26 α k + m k α + = α α + α k α + α + m k

Multinomial: Predictive distribution P x D = Multi α 1 + m 1 α +,, α K + m K α + Bayesian prediction combines sufficient statistics from imaginary Dirichlet samples and real data samples 27

Example: MLE vs. Bayesian Effect of different priors on smoothing parameter estimates θ~beta(10,10) θ~beta(1,1) 28 [Koller & Friedman Book]

Bayesian Estimation Gaussian distribution with unknown μ (known σ) P x μ ~(μ, σ 2 ) P(μ)~(μ 0, σ 0 2 ) P μ D P D μ P μ = P(x n μ P(μ) P μ D exp 1 2 σ 2 + 1 σ 0 2 μ2 2 x n σ 2 + μ 0 σ 0 2 μ 29

Bayesian Estimation Gaussian distribution with unknown μ (known σ) P μ D ~(μ, σ 2 ) μ = 1 2 σ 0 σ 2 0 + σ 2 σ 2 = 1 σ 0 2 + σ 2 x n + σ 2 σ 0 2 + σ 2 μ 0 P x D = P x μ P μ D dμ exp 1 2 x μ 2 σ 2 + σ 2 P x D ~(μ, σ 2 + σ 2 ) 30

Conjugate prior for Gaussian distribution Known μ, unknown σ 2 The conjugate prior for λ = 1/σ 2 is a Gamma with shape a 0 and rate (inverse scale) b 0 The conjugate prior for σ 2 is Inverse-Gamma Unknown μ and unknown σ 2 The conjugate prior P μ, λ is ormal-gamma Multivariate case: The conjugate prior P μ, Λ is ormal-wishart Gam λ a, b = 1 Γ(a) ba x a 1 exp bλ InvGam σ 2 a, b = 1 Γ(a) ba x a 1 exp( b σ 2) P μ, λ = μ μ 0, βλ 1 Gam (λ a 0, b 0 ) 31

Bayesian estimation: Summary Bayesian approach treats parameters as random variables Learning is then a special case of inference Asymptotically equivalent to MLE For some parametric distributions, has closed form (that is obtained based on prior parameters and sufficient statistics from data) 32

Learning in Bayesian networks Learning MLE Likelihood decomposes on nodes conditional distributions Bayesian We can make some assumptions (global & local independencies) to simplify the learning 33

Likelihood decomposition: Example Directed factorization causes likelihood to locally decompose: P X θ = P X 1 θ 1 P X 2 X 1, θ 2 P X 3 X 1, θ 3 P X 4 X 2, X 3, θ 4 X 1 X 2 X 3 X 4 34

Decomposition of the likelihood If we assume the parameters for each CPD are disjoint (i.e., disjoint parameter sets θ Xi Pa Xi for different variables), and the data is complete, then the log-likelihood function decomposes on nodes: L θ; D = P D θ = = i V i V (n) (n) P X i PaXi, θi = L i θ i ; D = (n) (n) P X i PaXi, θi (n) P X (n) i PaXi, θ i L i (θ i ; D) i V 35 θ i = argmax L i θ i ; D θ = [ θ 1,, θ V ] θ i θ = argmax L θ; D θ θ i θ Xi Pa Xi

Local decomposition of the likelihood: Table-CPDs The choice of parameters given different value assignment to the parents are independent of each other Consider only data points that agree with the parent assignment: L i θ Xi Pa Xi ; D = θ (n) (n) Xi PaXi = u Val(Pa Xi ) x Val(X i ) θ x u (n) (n) I X i =x,paxi =u 36

Local decomposition of the likelihood: Table-CPDs Each row is multinomial distribution θ i θ Xi Pa Xi θ Xi u 1 θ Xi u L Pa Xi = u 1 Pa Xi = u L X i = 1 X i = K i θ 1 u1 i θ K u1 i θ 1 ul i θ K ul m i k, u = (n) (n) I X i = k, PaXi = u u k i θ k u = 1 i θ k u = m i k, u k m i k, u = (n) (n) I X i = k, PaXi = u (n) I Pa Xi = u 37

Overfitting For large instances Z Pa Xi Val(Z), most buckets will have very few umber of parameters will increases exponentially with Pa Xi Poor estimation With limited data, overfitting occurs we often get better generalization with simpler structures zero count problem or the sparse data problem e.g., consider a naïve bayes classifier and an email d containing a word that does not occur in any training doc for the class c, then P d c = 0 38

ML Example: aïve Bayes classifier Training phase π Y (n) Y {1,, C} θ i = θ Xi Y = θ Xi 1,, θ Xi C Test phase θ i = θ Xi 1,, θ Xi C π Y θ i X i (n) n = 1,, X i i = 1,.., D π i = 1,, D θ 1 θ 2 Y θ D X 1 X 2 X D 39

ML Example: aïve Bayes classifier L θ; D = P(Y n π) D i=1 P(X i n Y n, θ Xi Y) π c = I(Y n = c) Discrete inputs or features (X i {1,, K}): i θ k c = p X i = k Y = c I(X i n =k,y n =c) I(Y n =c) 40

Recall: Maximum conditional likelihood Example: discriminative classifier eeds to learn the conditional distribution (not joint distribution) Given D = x n, y n and a parametric conditional distribution for P(y x; θ), we find: θ = argmax θ P(y n x n ; θ) We will also see maximum conditional likelihood for more general CPDs than tabular ones 41

Maximum conditional likelihood Linear regression P y x, w = y w T φ x, σ 2 Linear: φ x = [1, x 1,, x D ] D = x n, y n x w y L Y X w; D = w = argmax θ P Y (n) X n, w = Y n w T φ X n, σ 2 Y n w T φ X n, σ 2 = Φ T Φ 1 Φ T Y 42

Maximum conditional likelihood Logistic regression P y = 1 X = σ w T φ x Linear LR: φ x = [1, x 1,, x D ] P y x, w = Ber(y σ w T φ x ) D = x n, y n Y {0,1} w x y L Y X w, D = 43 = P y (n) x n, w = Ber y n σ w T φ x n σ w T φ x n y (n) 1 σ w T φ x n 1 y (n) σ z = 1 1 + exp z

MLE for Bayesian networks: Summary For a B with disjoint sets of parameters in CPDs, likelihood decomposes as product of local likelihood functions on nodes Thus, we optimize likelihoods on different nodes separately For table CPDs, local likelihood further decomposes as product of likelihood for multinomials, one for each parent combination Sparse data problem of MLE 44

Bayesian learning for Bs Global parameter independence Global parameter independence assumption: θ 1 P θ D = 1 P D = P θ = L i θ i ; D = P θ i D i V P D θ = P(θ i ) i V P D θ P θ = 1 P D (n) P X (n) i PaXi, θ i L i (θ i ; D) i V θ 2 L i θ i ; D P θ i i V Posteriors of θ are independent given complete data and independent priors X 2 (n) X 1 (n) X 3 (n) θ 3 θ i θ Xi Pa Xi 45

Bayesian learning for Bs Global parameter independence How can we find directly from the structure of the Bayesian network when P θ satisfies global independence? 46

Bayesian learning for Bs Global parameter independence Predictive distribution P x D = P x θ, D P θ D dθ P x D = P x θ P θ D dθ Instances are independent given the parameters (for complete data) = i V P x i Pa Xi, θ i P θ i D dθ i V = i V P x i Pa Xi, θ i P θ i D dθ i Solve the prediction problem for each CPD independently and then combine the results. 47 θ i θ Xi Pa Xi f x g y dxdy = f x dx g y dy

Local decomposition for table CPDs Prior P θ X PaX satisfies local parameter independence assumption if: P θ X PaX = P θ X u Pa X =u The groups of parameters in different rows of a CPD are independent The posterior of P θ X u D and P θ X u D are independent despite v structure on X, because Pa X acts like a multiplexer Y (n) Y {1,, C} θ X 1 θ X C 48 X (n) n = 1,,

Global and local parameter independence Let the data be complete and CPDs be tabular. If the prior P(θ) satisfies global and local parameter independence, then: Pa Xi = u K = Val(X i ) P θ D = i V P X i = k u, D = Pa Xi P(θ Xi Pa Xi D) i α k u K k =1 + m i (k, u) i α k u + m i (k, u) i θ Xi u~dir(α 1 u i,, α K u ) i θ Xi u D ~Dir(α 1 u i + m i (1, u),, α K u + m i (K, u)) 49

Priors for Bayesian learning Priors α Xi Pa Xi must be defined: α x u x Val X i, u Val(Pa Xi ) K 2 prior: a fixed prior for all hyper-parameters α x u = 1 We can store an α and a prior distribution P on the variables in the network: α Xi Pa Xi = αp X i, Pa Xi BDe prior: Define P as a set of independent marginals over X i s 50

aïve Bayes Example: Bayesian learning Y {1,, C} θ i = θ Xi Y = θ Xi 1,, θ Xi C Training phase α π α Xi c Y (n) θ Xi c c = 1,, C X i (n) i = 1,, D n = 1,, 51

aïve Bayes Example: Bayesian learning Y~Mult π (i.e., P Y = c π = π c ) P π α 1,, α C = Dir(π α 1,, α C ) P π D, α 1,, α C = Dir π α 1 + m 1,, α C + m c P Y = c D = C c =1 α c + m c α c + m c Discrete inputs or features (X i {1,, K}): P X i = k Y = c, D = K k =1 i α k c +m i (k,c) i α k c +m i (k,c) m c = I(Y n = c) m i k, c = I(X i n = k, Y n = c) 52

Global & local independence For nodes with no parents, parameters define a single distribution Bayesian or ML learning can be used as in the simple density estimation on a variable More generally, for tabular CPDs there are multiple categorical distributions per node, one for every combination of parent variables Learning objective decomposes into multiple terms, one for subset of training data with each parent configuration Apply independent Bayesian or ML learning to each 53

Shared parameters Sharing CPDs or sharing parameters within a single CPD: P X i Pa Xi, θ = P X j Pa Xj, θ P X i u 1, θ i s = P X j u 2, θ i s MLE: For networks with shared CPDs, sufficient statistics accumulate over all uses of CPD Aggregate sufficient statistics: collect all the instances generated from the same conditional distribution and combine sufficient statistics from them 54

Markov chain P X 1,, X T θ = P X 1 π T t=2 P(X t X t 1, A t 1,t ) X 1 X 2 X T 1 X T Shared Parameters: P X 1,, X T θ = P X 1 π T t=2 P(X t X t 1, A) 55

Markov chain π A = θ S S Initial state probability: X 1 X 2 X T 1 X T π i = P X 1 = i, 1 i K State transition probability: A ji = θ j i = P X t+1 = i X t = j, K i=1 A ji = 1 1 i, j K 56

Markov chain: Parameter learning by MLE P D θ = P X 1 (n) π T t=2 (n) (n) P(X t Xt 1, A) A ji = T (n) (n) t=2 I X t 1 = j, Xt = i T (n) t=2 I X t 1 = j π i = I X 1 (n) = i 57

Markov chain: Bayesian learning Dirichlet prior α on A A ji = P X t+1 = i X t = j A j,. ~Dir(α j,1,, α j,k ) Assign a Dirichlet prior to each row of the transition matrix A P X t+1 = i X t = j, D, α j,. = T (n) (n) t=2 I X t 1 = j, Xt = i + αj,i T (n) t=2 I X t 1 = j + K i =1 α j,i 58

Hyper-parameters in Bayesian approach We already consider either parameter independencies or parameter sharing Hierarchical priors gives us a flexible language to introduce dependencies in the priors over parameters. useful when we have small amount of examples relevant to each parameter but we believe that some parameters are reasonably similar. In such situations, hierarchical priors spread the effect of the observations between parameters with shared hyper-parameters. However, when we use hyper-parameters we transformed our learning problem into one that includes a hidden variable. 59

Hyper-parameters: Example [Koller & Friedman Book] The effect of the prior will be to shift both θ Y x 0 and θ Y x 1 to be closer to each other 60

Hyper-parameters: Example [Koller & Friedman Book] If we believe that two variables Y and Z depend on X in a similar (but not identical) way 61

References Koller and Friedman, Chapter 17. 62