Learning Bayesian network : Given structure and completely observed data

Similar documents
Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Density Estimation: ML, MAP, Bayesian estimation

Bayesian Models in Machine Learning

Density Estimation. Seungjin Choi

PMR Learning as Inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probabilistic Graphical Models

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Probabilistic Graphical Models

Introduction to Probabilistic Machine Learning

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

CSC321 Lecture 18: Learning Probabilistic Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Lecture : Probabilistic Machine Learning

Bayesian Methods: Naïve Bayes

An Introduction to Bayesian Machine Learning

4: Parameter Estimation in Fully Observed BNs

Probabilistic Graphical Models

Lecture 6: Graphical Models: Learning

Naïve Bayes classification

Bayesian Machine Learning

Hierarchical Models & Bayesian Model Selection

Bayesian Learning (II)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning Summer School

Bayesian RL Seminar. Chris Mansley September 9, 2008

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

COS513 LECTURE 8 STATISTICAL CONCEPTS

STA 4273H: Statistical Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

CPSC 540: Machine Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Probability and Statistics (Continued)

Probabilistic Graphical Models

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS-E3210 Machine Learning: Basic Principles

Graphical Models for Collaborative Filtering

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Parametric Techniques Lecture 3

Bayesian Methods for Machine Learning

Introduction to Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning, Fall 2012 Homework 2

Parametric Techniques

Lecture 2: Priors and Conjugacy

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

13: Variational inference II

COMP90051 Statistical Machine Learning

Computational Cognitive Science

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

{ p if x = 1 1 p if x = 0

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

The Monte Carlo Method: Bayesian Networks

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Introduction to Probabilistic Graphical Models

Bayesian Regression Linear and Logistic Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

MLE/MAP + Naïve Bayes

Introduction to Bayesian inference

Probabilistic Graphical Models

Introduction to Machine Learning

Unsupervised Learning

Logistic Regression. Machine Learning Fall 2018

Statistical learning. Chapter 20, Sections 1 3 1

Lecture 4: Probabilistic Learning

CHAPTER 2 Estimating Probabilities

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Announcements. Proposals graded

Statistical Data Mining and Machine Learning Hilary Term 2016

Non-parametric Methods

Statistical learning. Chapter 20, Sections 1 4 1

CPSC 540: Machine Learning

10708 Graphical Models: Homework 2

Chris Bishop s PRML Ch. 8: Graphical Models

GAUSSIAN PROCESS REGRESSION

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Lecture 13 : Variational Inference: Mean Field Approximation

Intelligent Systems (AI-2)

Probabilistic Graphical Models (I)

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Lecture 2: Conjugate priors

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Curve Fitting Re-visited, Bishop1.2.5

Transcription:

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

Learning problem Target: true distribution P that maybe correspond to M = K, θ Hypothesis space: specified probabilistic graphical models Data: set of instances sampled from P Learning goal: selecting a model M to construct the best approximation to M according to a performance metric 2

Learning tasks on graphical models Parameter learning / structure learning Completely observable / partially observable data Directed model / undirected model 3

Parameter learning in directed models Complete data We assume that the structure of the model is known consider learning parameters for a B with a given structure Goal: estimate CPDs from a dataset D = {x 1,..., x () } of independent, identically distributed (i.i.d.) training samples. Each training sample x n = x 1 (n),, xl (n) is a vector that every element x i (n) is known (no missing values, no hidden variables) 4

Density estimation review We use density estimation to solve this learning problem Density estimation: Estimating the probability density function P(x), given a set of data points x i drawn from it. Parametric methods: Assume that P(x) in terms of a specific functional form which has a number of adjustable parameters MLE and Bayesian estimate MLE: eed to determine θ given {x 1,, x () } MLE overfitting problem Bayesian estimate: Probability distribution P(θ) over spectrum of hypotheses eeds prior distribution on parameters i=1 5

Density estimation: Graphical model i.i.d assumption θ X (1) X (2) X () θ X (i) i = 1,..., α hyperparametrs α θ α θ X (1) X (2) X () X (n) n = 1,..., 6

Maximum Likelihood Estimation (MLE) Likelihood is the conditional probability of observations D = x (1), x (2),, x () given the value of parameters θ Assuming i.i.d. (independent, identically distributed) samples P D θ = P x (1),, x () θ = likelihood of θ w.r.t. the samples P(x (n) θ) Maximum Likelihood estimation 7 θ ML = argmax θ θ ML = argmax θ i=1 P D θ = argmax θ ln p x (i) θ P(x (n) θ) MLE has closed form solution for many parametric distributions

MLE: Bernoulli distribution Given: D = x (1), x (2),, x (), m heads (1), m tails (0): p x θ = θ x 1 θ 1 x p x = 1 θ = θ p D θ = p(x n θ) = x n θ 1 θ 1 x n ln p D θ = ln p(x n θ) = {x n ln θ + (1 x n ) ln 1 θ } ln p D θ θ = 0 θ ML = i=1 x (n) = m

MLE: Multinomial distribution Multinomial distribution (on variable with K state): Parameter space: θ = θ 1,, θ K θ i 0,1 K k=1 θ k = 1 P x θ = K x θ k k k=1 P x k = 1 = θ k θ 2 9 Variable: 1-of-K coding x = x 1,, x K x k {0,1} K k=1 x k = 1 θ 1 θ 3 θ 1 + θ 2 + θ 3 = 1 where θ i 0,1 that is a simplex showing the set of valid parameters

MLE: Multinomial distribution D = x (1), x (2),, x () P D θ = P(x n θ) = K k=1 θ k x k (n) = L θ, λ = ln p D θ + λ(1 θ k ) K k=1 K k=1 θ k x k (n) m k = K k=1 x k (n) m k = θ k = x k (n) = m k 10

MLE: Gaussian with unknown μ ln P(x n μ) = 1 2 ln 2πσ 1 2σ 2 x n μ 2 ln P D μ μ = 0 μ ln p x (n) μ = 0 1 σ 2 x(n) μ = 0 μ ML = 1 x n 12

Bayesian approach Parameters θ as random variables with a priori distribution utilizes the available prior information about the unknown parameter As opposed to ML estimation, it does not seek a specific point estimate of the unknown parameter vector θ Samples D convert the prior densities P θ density P θ D into a posterior Keep track of beliefs about θ s values and uses these beliefs for reaching conclusions 13

Maximum A Posteriori (MAP) estimation MAP estimation θ MAP = argmax p θ D θ Since p θ D p D θ p(θ) θ MAP = argmax θ p D θ p(θ) Example of prior distribution: p θ = (θ 0, σ 2 ) 14

Bayesian approach: Predictive distribution Given a set of samples D = x i i=1, a prior distribution on the parameters P(θ), and the form of the distribution P x θ We find P θ D and use it to specify P x = P(x D) on new data as an estimate of P(x): P x D = P x, θ D dθ = P x D, θ P θ D dθ = P x θ P θ D dθ Predictive distribution Analytical solutions exist for very special forms of the involved functions 15

Conjugate Priors We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties Choosing a prior such that the posterior distribution that is proportional to p(d θ)p(θ) will have the same functional form as the prior. α, D α P(θ α ) P D θ P(θ α) Having the same functional form 16

Prior for Bernoulli Likelihood Beta distribution over θ [0,1]: Beta θ α 1, α 0 θ α 1 1 1 θ α 0 1 Beta θ α 1, α 0 = Γ(α 0 + α 1 ) Γ(α 0 )Γ(α 1 ) θα 1 1 1 θ α 0 1 E θ = α 1 α 0 + α α 1 1 θ = α 0 1 + α 1 1 most probable θ Beta distribution is the conjugate prior of Bernoulli: P x θ = θ x 1 θ 1 x 17

Beta distribution 18

Benoulli likelihood: posterior Given: D = x (1), x (2),, x (), m heads (1), m tails (0) p θ D p D θ p(θ) = i=1 θ x i 1 θ 1 x i Beta θ α 1, α 0 θ m+α1 1 1 θ m+α 0 1 p θ D Beta θ α 1, α 0 α 1 = α 1 + m α 0 = α 0 + m θ α 1 1 1 θ α 0 1 m = x (i) i=1 19

Example Prior Beta: α 0 = α 1 = 2 p x θ = θ x 1 θ 1 x Bernoulli p x = 1 θ θ Posterior Beta:α 1 = 5, α 0 = 2 Given: D = x (1), x (2),, x () : m heads (1), m tails (0) θ α 0 = α 1 = 2 20 θ θ MAP = argmax θ D = 1,1,1 = 3, m = 3 P θ D = α 1 1 α 1 1 + α 0 1 = 4 5

Benoulli: Predictive distribution Training samples: D = x (1),, x () P θ = Beta θ α 1, α 0 θ α1 1 1 θ α 0 1 P θ D = Beta θ α 1 + m, α 0 + m θ α1+m 1 1 θ α 0+ m 1 21 P x D = P x θ P θ D dθ = E P θ D P(x θ) P x = 1 D = E P θ D θ = α 1 + m α 0 + α 1 +

Dirichlet distribution Input space: θ = θ 1,, θ K T θ k 0,1 K k=1 θ k = 1 = P θ α Γ(α) K α θ k 1 k k=1 K α θ k 1 k Γ α 1 Γ(α K ) k=1 α = K k=1 α k E θ k = α α k θ k = α k 1 α K 22

Dirichlet distribution: Examples α = [0.1,0.1,0.1] α = [1,1,1] α = [10,10,10] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of α correspond to more confidence on the prior belief (i.e., more imaginary samples) 23

Dirichlet distribution: Example α = [2,2,2] α = [20,2,2] 24

Multinomial distribution: Prior Dirichlet distribution is the conjugate prior of Multinomial P θ D, α P D θ P θ α K m θ k +α k 1 k k=1 m = m 1,, m K T sufficient statistics of data P θ D, α = Dir θ α + m P θ α θ~dir(α 1,, α K ) P θ D, α θ~dir(α 1 + m 1,, α K + m K ) 25

Multinomial: Predictive distribution Training samples: D = x (1),, x () P θ = Dir θ α 1,, α K P θ D = Dir θ α 1 + m 1,, α K + m K P x D = P x θ P θ D dθ = E P θ D P(x θ) P x k = 1 D = E P θ D θ k = α k + m k α + Larger α more confidence in our prior 26 α k + m k α + = α α + α k α + α + m k

Multinomial: Predictive distribution P x D = Multi α 1 + m 1 α +,, α K + m K α + Bayesian prediction combines sufficient statistics from imaginary Dirichlet samples and real data samples 27

Example: MLE vs. Bayesian Effect of different priors on smoothing parameter estimates θ~beta(10,10) θ~beta(1,1) 28 [Koller & Friedman Book]

Bayesian Estimation Gaussian distribution with unknown μ (known σ) P x μ ~(μ, σ 2 ) P(μ)~(μ 0, σ 0 2 ) P μ D P D μ P μ = P(x n μ P(μ) P μ D exp 1 2 σ 2 + 1 σ 0 2 μ2 2 x n σ 2 + μ 0 σ 0 2 μ 29

Bayesian Estimation Gaussian distribution with unknown μ (known σ) P μ D ~(μ, σ 2 ) μ = 1 2 σ 0 σ 2 0 + σ 2 σ 2 = 1 σ 0 2 + σ 2 x n + σ 2 σ 0 2 + σ 2 μ 0 P x D = P x μ P μ D dμ exp 1 2 x μ 2 σ 2 + σ 2 P x D ~(μ, σ 2 + σ 2 ) 30

Conjugate prior for Gaussian distribution Known μ, unknown σ 2 The conjugate prior for λ = 1/σ 2 is a Gamma with shape a 0 and rate (inverse scale) b 0 The conjugate prior for σ 2 is Inverse-Gamma Unknown μ and unknown σ 2 The conjugate prior P μ, λ is ormal-gamma Multivariate case: The conjugate prior P μ, Λ is ormal-wishart Gam λ a, b = 1 Γ(a) ba x a 1 exp bλ InvGam σ 2 a, b = 1 Γ(a) ba x a 1 exp( b σ 2) P μ, λ = μ μ 0, βλ 1 Gam (λ a 0, b 0 ) 31

Bayesian estimation: Summary Bayesian approach treats parameters as random variables Learning is then a special case of inference Asymptotically equivalent to MLE For some parametric distributions, has closed form (that is obtained based on prior parameters and sufficient statistics from data) 32

Learning in Bayesian networks Learning MLE Likelihood decomposes on nodes conditional distributions Bayesian We can make some assumptions (global & local independencies) to simplify the learning 33

Likelihood decomposition: Example Directed factorization causes likelihood to locally decompose: P X θ = P X 1 θ 1 P X 2 X 1, θ 2 P X 3 X 1, θ 3 P X 4 X 2, X 3, θ 4 X 1 X 2 X 3 X 4 34

Decomposition of the likelihood If we assume the parameters for each CPD are disjoint (i.e., disjoint parameter sets θ Xi Pa Xi for different variables), and the data is complete, then the log-likelihood function decomposes on nodes: L θ; D = P D θ = = i V i V (n) (n) P X i PaXi, θi = L i θ i ; D = (n) (n) P X i PaXi, θi (n) P X (n) i PaXi, θ i L i (θ i ; D) i V 35 θ i = argmax L i θ i ; D θ = [ θ 1,, θ V ] θ i θ = argmax L θ; D θ θ i θ Xi Pa Xi

Local decomposition of the likelihood: Table-CPDs The choice of parameters given different value assignment to the parents are independent of each other Consider only data points that agree with the parent assignment: L i θ Xi Pa Xi ; D = θ (n) (n) Xi PaXi = u Val(Pa Xi ) x Val(X i ) θ x u (n) (n) I X i =x,paxi =u 36

Local decomposition of the likelihood: Table-CPDs Each row is multinomial distribution θ i θ Xi Pa Xi θ Xi u 1 θ Xi u L Pa Xi = u 1 Pa Xi = u L X i = 1 X i = K i θ 1 u1 i θ K u1 i θ 1 ul i θ K ul m i k, u = (n) (n) I X i = k, PaXi = u u k i θ k u = 1 i θ k u = m i k, u k m i k, u = (n) (n) I X i = k, PaXi = u (n) I Pa Xi = u 37

Overfitting For large instances Z Pa Xi Val(Z), most buckets will have very few umber of parameters will increases exponentially with Pa Xi Poor estimation With limited data, overfitting occurs we often get better generalization with simpler structures zero count problem or the sparse data problem e.g., consider a naïve bayes classifier and an email d containing a word that does not occur in any training doc for the class c, then P d c = 0 38

ML Example: aïve Bayes classifier Training phase π Y (n) Y {1,, C} θ i = θ Xi Y = θ Xi 1,, θ Xi C Test phase θ i = θ Xi 1,, θ Xi C π Y θ i X i (n) n = 1,, X i i = 1,.., D π i = 1,, D θ 1 θ 2 Y θ D X 1 X 2 X D 39

ML Example: aïve Bayes classifier L θ; D = P(Y n π) D i=1 P(X i n Y n, θ Xi Y) π c = I(Y n = c) Discrete inputs or features (X i {1,, K}): i θ k c = p X i = k Y = c I(X i n =k,y n =c) I(Y n =c) 40

Recall: Maximum conditional likelihood Example: discriminative classifier eeds to learn the conditional distribution (not joint distribution) Given D = x n, y n and a parametric conditional distribution for P(y x; θ), we find: θ = argmax θ P(y n x n ; θ) We will also see maximum conditional likelihood for more general CPDs than tabular ones 41

Maximum conditional likelihood Linear regression P y x, w = y w T φ x, σ 2 Linear: φ x = [1, x 1,, x D ] D = x n, y n x w y L Y X w; D = w = argmax θ P Y (n) X n, w = Y n w T φ X n, σ 2 Y n w T φ X n, σ 2 = Φ T Φ 1 Φ T Y 42

Maximum conditional likelihood Logistic regression P y = 1 X = σ w T φ x Linear LR: φ x = [1, x 1,, x D ] P y x, w = Ber(y σ w T φ x ) D = x n, y n Y {0,1} w x y L Y X w, D = 43 = P y (n) x n, w = Ber y n σ w T φ x n σ w T φ x n y (n) 1 σ w T φ x n 1 y (n) σ z = 1 1 + exp z

MLE for Bayesian networks: Summary For a B with disjoint sets of parameters in CPDs, likelihood decomposes as product of local likelihood functions on nodes Thus, we optimize likelihoods on different nodes separately For table CPDs, local likelihood further decomposes as product of likelihood for multinomials, one for each parent combination Sparse data problem of MLE 44

Bayesian learning for Bs Global parameter independence Global parameter independence assumption: θ 1 P θ D = 1 P D = P θ = L i θ i ; D = P θ i D i V P D θ = P(θ i ) i V P D θ P θ = 1 P D (n) P X (n) i PaXi, θ i L i (θ i ; D) i V θ 2 L i θ i ; D P θ i i V Posteriors of θ are independent given complete data and independent priors X 2 (n) X 1 (n) X 3 (n) θ 3 θ i θ Xi Pa Xi 45

Bayesian learning for Bs Global parameter independence How can we find directly from the structure of the Bayesian network when P θ satisfies global independence? 46

Bayesian learning for Bs Global parameter independence Predictive distribution P x D = P x θ, D P θ D dθ P x D = P x θ P θ D dθ Instances are independent given the parameters (for complete data) = i V P x i Pa Xi, θ i P θ i D dθ i V = i V P x i Pa Xi, θ i P θ i D dθ i Solve the prediction problem for each CPD independently and then combine the results. 47 θ i θ Xi Pa Xi f x g y dxdy = f x dx g y dy

Local decomposition for table CPDs Prior P θ X PaX satisfies local parameter independence assumption if: P θ X PaX = P θ X u Pa X =u The groups of parameters in different rows of a CPD are independent The posterior of P θ X u D and P θ X u D are independent despite v structure on X, because Pa X acts like a multiplexer Y (n) Y {1,, C} θ X 1 θ X C 48 X (n) n = 1,,

Global and local parameter independence Let the data be complete and CPDs be tabular. If the prior P(θ) satisfies global and local parameter independence, then: Pa Xi = u K = Val(X i ) P θ D = i V P X i = k u, D = Pa Xi P(θ Xi Pa Xi D) i α k u K k =1 + m i (k, u) i α k u + m i (k, u) i θ Xi u~dir(α 1 u i,, α K u ) i θ Xi u D ~Dir(α 1 u i + m i (1, u),, α K u + m i (K, u)) 49

Priors for Bayesian learning Priors α Xi Pa Xi must be defined: α x u x Val X i, u Val(Pa Xi ) K 2 prior: a fixed prior for all hyper-parameters α x u = 1 We can store an α and a prior distribution P on the variables in the network: α Xi Pa Xi = αp X i, Pa Xi BDe prior: Define P as a set of independent marginals over X i s 50

aïve Bayes Example: Bayesian learning Y {1,, C} θ i = θ Xi Y = θ Xi 1,, θ Xi C Training phase α π α Xi c Y (n) θ Xi c c = 1,, C X i (n) i = 1,, D n = 1,, 51

aïve Bayes Example: Bayesian learning Y~Mult π (i.e., P Y = c π = π c ) P π α 1,, α C = Dir(π α 1,, α C ) P π D, α 1,, α C = Dir π α 1 + m 1,, α C + m c P Y = c D = C c =1 α c + m c α c + m c Discrete inputs or features (X i {1,, K}): P X i = k Y = c, D = K k =1 i α k c +m i (k,c) i α k c +m i (k,c) m c = I(Y n = c) m i k, c = I(X i n = k, Y n = c) 52

Global & local independence For nodes with no parents, parameters define a single distribution Bayesian or ML learning can be used as in the simple density estimation on a variable More generally, for tabular CPDs there are multiple categorical distributions per node, one for every combination of parent variables Learning objective decomposes into multiple terms, one for subset of training data with each parent configuration Apply independent Bayesian or ML learning to each 53

Shared parameters Sharing CPDs or sharing parameters within a single CPD: P X i Pa Xi, θ = P X j Pa Xj, θ P X i u 1, θ i s = P X j u 2, θ i s MLE: For networks with shared CPDs, sufficient statistics accumulate over all uses of CPD Aggregate sufficient statistics: collect all the instances generated from the same conditional distribution and combine sufficient statistics from them 54

Markov chain P X 1,, X T θ = P X 1 π T t=2 P(X t X t 1, A t 1,t ) X 1 X 2 X T 1 X T Shared Parameters: P X 1,, X T θ = P X 1 π T t=2 P(X t X t 1, A) 55

Markov chain π A = θ S S Initial state probability: X 1 X 2 X T 1 X T π i = P X 1 = i, 1 i K State transition probability: A ji = θ j i = P X t+1 = i X t = j, K i=1 A ji = 1 1 i, j K 56

Markov chain: Parameter learning by MLE P D θ = P X 1 (n) π T t=2 (n) (n) P(X t Xt 1, A) A ji = T (n) (n) t=2 I X t 1 = j, Xt = i T (n) t=2 I X t 1 = j π i = I X 1 (n) = i 57

Markov chain: Bayesian learning Dirichlet prior α on A A ji = P X t+1 = i X t = j A j,. ~Dir(α j,1,, α j,k ) Assign a Dirichlet prior to each row of the transition matrix A P X t+1 = i X t = j, D, α j,. = T (n) (n) t=2 I X t 1 = j, Xt = i + αj,i T (n) t=2 I X t 1 = j + K i =1 α j,i 58

Hyper-parameters in Bayesian approach We already consider either parameter independencies or parameter sharing Hierarchical priors gives us a flexible language to introduce dependencies in the priors over parameters. useful when we have small amount of examples relevant to each parameter but we believe that some parameters are reasonably similar. In such situations, hierarchical priors spread the effect of the observations between parameters with shared hyper-parameters. However, when we use hyper-parameters we transformed our learning problem into one that includes a hidden variable. 59

Hyper-parameters: Example [Koller & Friedman Book] The effect of the prior will be to shift both θ Y x 0 and θ Y x 1 to be closer to each other 60

Hyper-parameters: Example [Koller & Friedman Book] If we believe that two variables Y and Z depend on X in a similar (but not identical) way 61

References Koller and Friedman, Chapter 17. 62