Graphical Models. Unit 11. Machine Learning University of Vienna

Graphical Models Unit 11 Machine Learning University of Vienna

Graphical Models Bayesian Networks (directed graph) The Variable Elimination Algorithm Approximate Inference (The Gibbs Sampler ) Markov networks (undirected graph) Markov Random fields (MRF) Hidden markov models (HMMs) The Viterbi Algorithm Kalman Filter

Simple graphical model 1 The graphs are the sets of nodes, together with the links between them, which can be either directed or not. If two nodes are not linked, than those two variables are independent. The arrows denote causal relationships between nodes that represent features. The probability of A and B is the same as the probability of A times the probability of B conditioned on A: P(a, b) = P(b a)p(a)

Simple graphical model 2 The nodes are separated into: observed nodes: where we can see their values directly hidden or latent nodes: whose values we hope to infer, and which may not have clear meanings in all cases. C is conditionally independent of B, given A

Example: Exam Panic Directed acyclic graphs (DAG) paired with the conditional probability tables are called Bayesian networks. B - denotes a node stating whether the exam was boring R - whether or not you revised A - whether or not you attended lectures P - whether or not you will panic before the exam

Example: Exam Panic P(b) 0.5 0.5 P( b) R P(r) P( r) T 0.3 0.7 F 0.8 0.2 A P(a) P( a) T 0.1 0.9 F 0.5 0.5 R A P(p) P( p) T T 0 1 T F 0.8 0.2 F T 0.6 0.4 F F 1 0 The probability of panicking: P(p) = b,r,a P(b, r, a, p) = P(b) P(r b) P(a b) P(p r, a) b,r,a

Example: Exam Panic Suppose you know that the course was boring, and want to work out how likely it is that you will panic before exam. P(p b) = 0.3 0.1 0+0.7 0.1 0.6+0.3 0.9 0.8+0.7 0.9 1 = 0.888 Suppose you know that the course was not boring, and want to work out how likely it is that you will panic before exam. P(p b) = 0.8 0.5 0+0.8 0.5 0.8+0.2 0.5 0.6+0.2 0.5 1 = 0.48 P(p) = P(p b)p(b) + P(p b)p( b) = 0.5 0.888 + 0.5 0.48 = 0.684

Backward inference or diagnosis Suppose you pank outside the exam. Why you are panicking - was it because you didn t come to the lectures, or because you didn t revise? Bayer s rule: P(r p) = P(p r)p(r) P(p) = b,ap(b,a,r,p) P(p) = = 0.5 0.3(0.1 0+0.9 0.8)+0.5 0.8(0.5 0+0.5 0.8) P(p) = = = 0.268 0.684 = 0.3918 Bayes rule is the reason why this type of graphical model is known as a Bayesian network.

Computational costs For a graph with N nodes where each node can be either true or false the computational costs is O(2 N ). The problem of exact inference on Bayesian networks is NP-hard. For polytrees where there is at most one path between any two nodes, the computational cost is linear in the size of the network. Unfortunately, it is rare to find such polytrees in real examples, so we will consider approximate inference.

Variable Elimination Algorithm With variable elimination algorithm one can speed things up a little by minimisation programm loops. The conditional probability tables are converted into λ tables, which simply list all of the possible values for all variables and which initially contain the conditional probabilities: R A P λ T T T 0 T T F 1 T F T 0.8 T F F 0.2 F T T 0.6 F T F 0.4 F F T 1 F F F 0

Variable Elimination algorithm To eliminate R from the graph we do following calculation: B R λ R A λ T T 0.3 T T 0 T F 0.7 T F 0.8 F T 0.8 F T 0.6 F F 0.2 F F 1 B A λ T T 0.3 0 + 0.7 0.6 = 0.42 T F 0.3 0.8 + 0.7 1 = 0.94 F T 0.8 0 + 0.2 0.6 = 0.12 F F 0.8 0.8 + 0.2 1 = 0.84

Variable Elimination Algorithm I create the λ tables: - for each variable v: * make a new table * for all possible true assignments x of the parent variables: - add rows for P(v x) and 1 P(v x) to the table * add this table to the set of tables eliminate known variable v: - for each table * remove rows where v is incorrect * remove column for v from table

Variable Elimination Algorithm II eliminate other variable (where x is the variable to keep): - for each variable v to be eliminated: * create a new table t * for each table t containing v: v true,t = v true,t P(v x) v false,t = v false,t P( v x) * v true,t = t (v true,t ) * v false,t = t (v false,t ) - replace tables t with the new t calculate conditional probability: - for each table: * x true = x true P(x) * x false = x false P( x) * probability is x true/(x true + x false )

The Markov Chain Monto Carlo methods (MCMC) sample from the hidden variables - start at the top of the graph - sample from each of the known probability distributions weight the samples by their likelihoods In our example: generate a sample from P(b) use that value in the conditional probability tables for R and A to compute P(r b = sample value) and P(a b = sample value) use these three values to sample from P(p b, a, r), take as many samples as you like in this way

Gibbs sampling In MCMC we have to work throught the graph from top to bottom and select rows from the conditional probability tables that match the previous case. Better to sample from the unconditional destribution and reject any samples that don t have the correct prior probability (rejection sampling). We can work out what evidence we already have and use this variable to assign likelihoods to the other variables that are sampled. set values for all of the possible probabilities, based on either evidence or random choices. find the probability distribution with Gibbs sampling

Gibbs sampling The probabilities in the network are: p(x) = j p(x j x αj ), where x αj are the parent nodes of x j. In a Bayesion network, any given variable is independent of any node that is not their child, given their parents: p(x j x j ) = p(x j x αj ) k β(j) p(x k x α(k)), where β(j) is the set of children of node x j and x j signifies all values of x i except x j. For any node we only need to conside its parents, its children, and the other parents of the children. This is known as the Markov blanket of the node.

The Gibbs Sampler for each variable x j : - initialise x (0) j repeat - for each variable x j : * sample x (i+1) 1 from p(x 1 x (i) 2,, x (i) n ) * sample x (i+1) 2 from p(x 2 x (i+1) 1,, x (i) n ) *... * sample x (i+1) n from p(x n x (i+1) 1,, (i+1) n 1 ) until you have enough samples

Markov Random Fields (MRF): image denoising Markov property: the state of a particular node is a function only of the states of its immediate neighbours. Binary image I with pixel values I xi,x j { 1, 1} has noise. We want to recover an ideal image I x i,x j that has no noise in it. If the noise is small, then there should be a good correlation between I xi,x j and I x i,x j. Assume also that within a small patch or region in an image, there is a good correlation between pixels: I xi,x j should correlate well with I xi +1,x j, I xi,x j 1 etc.

Ising model The original theory of MRFs was worked out by physicists in ising model: a statistic description of a set of atoms connected in a chain, where each can spin up (+1) or down (-1) and whose spin effects those connected to it in the chain. Physicists tend to think of the energy of such systems. Stable states are those with the lowest energy, since the system needs to get extra energy if it wants to move out of this state.

Markov Random Fields (MRF): image denoising The energy of our pair of images must be low when the pixels match. The energy of the same pixel in two images: ηi xi,x j I x i,x j, where η is a positive constant. The energy of two neighbouring pixels is ζi xi,x j I xi +1,x j. The total energy: E(I, I ) = η N I xi,x j I xi ±1,x j ±1 ζ i,j N I xi,x j I x i,x j, i,j where the index of the pixels is assumed to run from 1 to N in both the x and y directions.

The Markov Random Field Image Denoising Algorithm given a noisy image I and the original image I, together with parameters η, ζ: loop over the pixels of image I : - compute the energies with the curent pixel being 1 and 1 - pick the one with lower energy and set its value in I accordingly

MRF example: a world map Using the MRF image denoising algorithm with η = 2.1, ζ = 1.5 on a map of the world corrupted by 10% uniformly distributed random noise (left) gives image right which has about 4% error, although it has smoothed out the edges of all continents.

Hidden Markov Models (HMMs) The Hidden Markov Model is one of the most popular graphical models. It is used in speech processing and in a lot of statistical work. The HMM generally works on a set of temporal data. At each clock tick the system moves into a new state, which can be the same as the previous one. You see observations that do not uniquely identify the state. This is where the hidden in the title comes from. The HMM is the simplest dynamic Bayesian network. Generally is assumed that the markov chain is ergodic: it means that there is a non-zero probbility of reaching every state eventually, no matter what the starting state.

Hidden Markov Models (HMMs) There are four things that you can do in the evening: go to the pub, watch TV, go to a party, study I can do observations if you look tired, hungover, scared or fine (hidden states). I don t know why you look the way you do, but I can guess by assigning probabilities to those things.

Hidden Markov Models (HMMs) The HMM itself is made up of the transition probabilities a ij and the observation probabilities b jk : j a ij = 1, k b jk = 1 TV Pub Party Study Previous night TV 0.4 0.3 0.1 0.2 Pub 0.6 0.05 0.1 0.25 Party 0.7 0.05 0.05 0.2 Study 0.3 0.4 0.25 0.05 Tired Hungover Scared Fine TV 0.2 0.1 0.2 0.5 Pub 0.4 0.2 0.1 0.3 Party 0.3 0.4 0.2 0.1 Study 0.3 0.05 0.3 0.35

Hidden Markov Models (HMMs) After a couple a weeks of observations there are three things that I want to do with the data: see how well the sequence of observations that I ve made match my current HMM work out the most probable sequence of states that you ve been in based on my observation given several sets of observations (for example, by watching several students) generate a good HMM for the data.

The Forward Algorithm Suppose I see the following observations: O = (tired, tired, fine, hungover, hungover, scared, hungover, fine) The probability that my observations O = {o(1),, o(t )} come from the model can be computed using simple conditional probability. P(O) = R P(O Ω r )P(Ω r ) r=1 The r index describes a possible sequence of states, so Ω 1 is one sequence, Ω 2 another, and so on.

The Forward Algorithm We use the Markov property P(Ω r ) = T t=1 P(ω j(t) ω i (t 1)) = T t=1 a ij and P(O Ω r ) = T t=1 P(o k(t) ω j (t)) = T t=1 b jk R T R T P(O) = P(o k (t) ω j (t))p(ω j (t) ω i (t 1)) = b jk a ij r=1 t=1 r=1 t=1

The Forward Trellis A new variable α i (t) describes the probability that at time t the state is ω i and the first (t 1) steps all matched the observations o(t): 0 t = 0, j initial state α j (t) = 1 t = 0, j = initial state. i α i(t 1)a ij b j(ot) otherwise

The Forward Trellis α TV (0) = 0.25, α Pub (0) = 0.25, α Party (0) = 0.25, α Study (0) = 0.25 α TV (1) = (α TV (0)a TV,TV + α Pub (0)a Pub,TV + α Party (0)a Party,TV + α Study (0)a Study,TV )b TV,Tired = (0.25 0.4 + 0.25 0.3 + 0.25 0.1 + 0.25 0.2) 0.2 = 0.05 α Pub (1) = (α TV (0)a TV,Pub + α Pub (0)a Pub,Pub + α Party (0)a Party,Pub + α Study (0)a Study,Pub )b Pub,Tired = (0.25 0.6 + 0.25 0.05 + 0.25 0.1 + 0.25 0.25) 0.4 = 0.1

The Forward Trellis α Party (1) = (α TV (0)a TV,Party + α Pub (0)a Pub,Party + α Party (0)a Party,Party + α Study (0)a Study,Party )b Party,Tired = (0.25 0.7 + 0.25 0.05 + 0.25 0.05 + 0.25 0.2) 0.3 = 0.075 α Study (1) = (α TV (0)a TV,Study + α Pub (0)a Pub,Study + α Party (0)a Party,Study + α Study (0)a Study,Study )b Study,Tired = (0.25 0.3 + 0.25 0.4 + 0.25 0.25 + 0.25 0.05) 0.3 = 0.075

The HMM Forward Algorithm For each observation in order o t, t = 1,, T - for each possible state s a s (t) = b s(ot) x (a x,t 1 a x,s )

The Viterbi Algorithm For each timestep we pick the state that is most likely as the next step in the path, rather than maintaining probabiliies of all possible paths. For each observation in order o t, t = 1,, T - for each possible state s - path(t) = arg max x (v x (t)) v s (t) = max (v x,t 1 a x,s b x s(ot)) So path(1) = Pub

The Baum-Welch or Forward-Backward Algorithm Unsupervised learning problem is to generate the HMM from sets of observations. We complement the forward algorithm with a variable β that take us backwards throught the HMM, i.e. β i (t) tells us the probability that at time t we are in state ω i and the result of the target sequence (times t + 1 to T ) will be generated correctly: 0 t = T, i final state β i (t) = 1 t = T, i = final state j β j(t + 1)a ij b j(ot+1 ) otherwise We can run backwards throught the HMM from the known end point.

The Backward Trellis β TV (8) = 0.25, β Pub (8) = 0.25, β Party (8) = 0.25, β Study (8) = 0.25 β TV (7) = β TV (8)a TV,TV b TV,fine + β Pub (8)a TV,Pub b Pub,fine + β Party (8)a TV,Party b Party,fine + β Study (8)a TV,Study b Study,fine = 0.25 0.4 0.5+0.25 0.3 0.3+0.25 0.1 0.1+0.25 0.2 0.35 = 0.0925 β Pub (7) = β TV (8)a Pub,TV b TV,fine + β Pub (8)a Pub,Pub b Pub,fine + β Party (8)a Pub,Party b Party,fine + β Study (8)a Pub,Study b Study,fine = 0.25 0.6 0.5+0.25 0.05 0.3+0.25 0.1 0.1+0.25 0.25 0.35 = 0.103125

The Backward Trellis β Party (7) = β TV (8)a Party,TV b TV,fine + β Pub (8)a Party,Pub b Pub,fine + β Party (8)a Party,Party b Party,fine + β Study (8)a Party,Study b Study,fine = 0.25 0.7 0.5+0.25 0.05 0.3+0.25 0.05 0.1+0.25 0.2 0.35 = 0.11 β Study (7) = β TV (8)a Study,TV b TV,fine + β Pub (8)a Study,Pub b Pub,fine + β Party (8)a Study,Party b Party,fine + β Study (8)a Study,Study b Study,fine = 0.25 0.3 0.5+0.25 0.4 0.3+0.25 0.25 0.1+0.25 0.05 0.35 = 0.078125

The Baum-Welch or Forward-Backward Algorithm We can use these forwards and backwards estimates to compute transition probabilities. Suppose we want to compute the probability of a transition between state ω i at time t and ω j at time t + 1. We run forwards our current model via α to get to state ω i at time t and run backwards to get to state ω j at time t + 1 via β. Then we use the current estimates of a ij and b jk. We normalise this calculation by how likely this particular training sequence is according to the current model, which is P(O a ij, b jk ). This value is usually called γ ij : γ ij = α i(t 1)a ij b jk β j (t) P(O a ij, b jk )

The update rule for transition probabilities T t=1 γ ij(t) tells us how many times we can expect to transition from state ω i to state ω j at any time in the sequence. We need to divide this number by the number of times we expect to transition out of state ω i, regardless of where we end up: The update rule for a ij : T γ im (t) t=1 m a ij = T t=1 γ ij(t) T t=1 m γ im(t)

The update rule for observation probabilities We need to think about the frequency that an observation o k is made in state j compared to any other symbol: b jk = T t=1,o(t)=o k m γ km(t) T t=1 m γ jm(t)

The HMM Baum-Welch Algorithm while updates have not converged: - E-step: - Compute forwards and backwards steps (α and β) - for each observation in order o t, t = 1 T * for each possible pair of states s and σ: γ σ,s,t = α σ,t a σ,s β s,t+1 b s,o(t+1) / max x (α x,t 1 )

The HMM Baum-Welch Algorithm - M-step: - for each possible pair of states s and σ: * a s,σ = t γs,σ,t/ y - for each observation o: * for each state s: y γs,x,t t γs,σ,t tally = t b s,o = sum(tally where observation o was seen) /total tally

Tracking Methods The Kalman Filter The state, which is hidden consists of the variables that we want to know, which we see throught noisy observation over time. makes an estimate of the next step, computes an eror term based on the value that was actually produced in the next step and tries to correct it then uses both of those to make the next prediction and iterates this procedure.

The Kalman Filter Process is linear and all of the distributions are Gaussian with constant covariance Q and R: X N (0, 1), so gilt The transition model (A): P(x t+1 x t ) = N (x t+1 Ax t, Q) The observation model (H): P(z t+1 x t+1 ) = N (z t+1 Hx t+1, R) Predicted observation: ẑ t+1 = HAx t+1 The error: z t+1 HAx t+1 Σ t is the covariance matrix of x t : Σ t+1 = AΣ t A T + Q is the covariance matrix of x t+1

The Kalman gain The Kalman filter weights these error computations by how much trust the filter currently has in its predictions: K t+1 = Σ t+1 H T (H Σ t+1 H T + R) 1 The update for the estimate is x t+1 = x t+1 + K t+1 (z t+1 Hx t+1 ) The update of covariance matrix: Σ t+1 = (I K t+1 H) Σ t+1

The Kalman Filter Algorithm given an initial estimate x(0) for each timestep: - predict the next step * predict state as x t+1 = AX t * predict covariance Σ t+1 = AΣ ta T + Q - update the estimate * compute the error in the estimate, ɛ = z t+1 HAx t+1 * compute the Kalman gain K t+1 = Σ t+1h T (H Σ t+1h T + R) 1 * update the state x t+1 = x t+1 + K t+1(z t+1 Hx t+1) * update the covariance Σ t+1 = (I K t+1h) Σ t+1

Tracking problem x - position, y- velocity of the object. x t = (y, y ) T The update equation: x t+1 = Ax t + Ba t+1, where the acceleration a t is a N(0, σ) ( ) ( ) 1 t 1 A =, B = 2 t2 0 1 t

Example: Tracking problem