Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Size: px
Start display at page:

Download "Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models"

Transcription

1 Andrea Passerini Statistical relational learning

2 Probability distributions Bernoulli distribution Two possible values (outcomes): 1 (success), 0 (failure). Parameters: p probability of success. Probability mass function: { p if x = 1 P(x; p) = 1 p if x = 0 Example: tossing a coin Head (success) and tail (failure) possible outcomes p is probability of head

3 Probability distributions Multinomial distribution (one sample) Models the probability of a certain outcome for an event with m possible outcomes {v 1,..., v m } Parameters: p 1,..., p m probability of each outcome Probability mass function: P(v i ; p 1,..., p m ) = p i Tossing a dice m is the number of faces p i is probability of obtaining face i

4 Probability distributions Gaussian (or normal) distribution Bell-shaped curve. Parameters: µ mean, σ 2 variance. Probability density function: p(x; µ, σ) = 1 (x µ)2 exp 2πσ 2σ 2 Standard normal distribution: N(0, 1) Standardization of a normal distribution N(µ, σ 2 ) z = x µ σ

5 Conditional probabilities conditional probability probability of x once y is observed P(x y) = P(x, y) P(y) statistical independence variables X and Y are statistical independent iff implying: P(x, y) = P(x)P(y) P(x y) = P(x) P(y x) = P(y)

6 Basic rules law of total probability The marginal distribution of a variable is obtained from a joint distribution summing over all possible values of the other variable (sum rule) P(x) = y Y P(x, y) P(y) = x X P(x, y) product rule conditional probability definition implies that P(x, y) = P(x y)p(y) = P(y x)p(x) Bayes rule P(y x) = P(x y)p(y) P(x)

7 Playing with probabilities Use rules! Basic rules allow to model a certain probability given knowledge of some related ones All our manipulations will be applications of the three basic rules Basic rules apply to any number of varables: P(y) = = = x z P(x, y, z) x P(y x, z)p(x, z) x z P(x y, z)p(y z)p(x, z) P(x z) z (sum rule) (product rule) (Bayes rule)

8 Playing with probabilities Example P(y x, z) = = = = = = P(x, z y)p(y) P(x, z) (Bayes rule) P(x, z y)p(y) P(x z)p(z) (product rule) P(x z, y)p(z y)p(y) P(x z)p(z) (product rule) P(x z, y)p(z, y) P(x z)p(z) (product rule) P(x z, y)p(y z)p(z) P(x z)p(z) (product rule) P(x z, y)p(y z) P(x z)

9 Graphical models Why All probabilistic inference and learning amount at repeated applications of the sum and product rules Probabilistic graphical models are graphical representations of the qualitative aspects of probability distributions allowing to: visualize the structure of a probabilistic model in a simple and intuitive way discover properties of the model, such as conditional independencies, by inspecting the graph express complex computations for inference and learning in terms of graphical manipulations represent multiple probability distributions with the same graph, abstracting from their quantitative aspects (e.g. discrete vs continuous distributions)

10 Bayesian Networks (BN) Description Directed graphical models representing joint distributions p(a, b, c) = p(c a, b)p(a, b) = p(c a, b)p(b a)p(a) a b Fully connected graphs contain no independency assumption Joint probabilities can be decomposed in many different ways by the product rule Assuming an ordering of the variables, a probability distributions over K variables can be decomposed into: c p(x 1,..., x K ) = p(x K x 1,..., x K 1 )... p(x 2 x 1 )p(x 1 )

11 Bayesian Networks Modeling independencies A graph not fully connected contains independency assumptions. The joint probability of a Bayesian network with K nodes is computed as the product of the probabilities of the nodes given their parents: x 1 x 2 x 3 x 4 x 5 x 6 x 7 p(x) = K p(x i pa i ) i=1 The joint probability for the BN in the figure is written as p(x 1 )p(x 2 )p(x 3 )p(x 4 x 1, x 2, x 3 )p(x 5 x 1, x 3 )p(x 6 x 4 )p(x 7 x 4, x 5 )

12 Conditional independence Introduction Two variables a, b are conditionally independent (written a b ) if: p(a, b) = p(a)p(b) Two variables a, b are conditionally independent given c (written a b c ) if: p(a, b c) = p(a c)p(b c) Independency assumptions can be verified by repeated applications of sum and product rules Graphical models allow to directly verify them through the d-separation criterion

13 d-separation Tail-to-tail Joint distribution: p(a, b, c) = p(a c)p(b c)p(c) a and b are not conditionally independent (written a b ): a c b p(a, b) = c p(a c)p(b c)p(c) p(a)p(b) a and b are conditionally independent given c: p(a, b c) = p(a, b, c) p(c) = p(a c)p(b c) c is tail-to-tail wrt to the path a b as it is connected to the tails of the two arrows a c b

14 d-separation Head-to-tail Joint distribution: a c b p(a, b, c) = p(b c)p(c a)p(a) = p(b c)p(a c)p(c) a and b are not conditionally independent: p(a, b) = p(a) c p(b c)p(c a) p(a)p(b) a and b are conditionally independent given c: p(a, b c) = p(b c)p(a c)p(c) p(c) = p(b c)p(a c) c is head-to-tail wrt to the path a b as it is connected to the head of an arrow and to the tail of the other one a c b

15 d-separation Head-to-head Joint distribution: p(a, b, c) = p(c a, b)p(a)p(b) a b a and b are conditionally independent: c p(a, b) = c p(c a, b)p(a)p(b) = p(a)p(b) a and b are not conditionally independent given c: p(a, b c) = p(c a, b)p(a)p(b) p(c) p(a c)p(b c) c is head-to-head wrt to the path a b as it is connected to the heads of the two arrows a c b

16 d-separation General Head-to-head Let a descendant of a node x be any node which can be reached from x with a path following the direction of the arrows A head-to-head node c unblocks the dependency path between its parents if either itself or any of its descendants receives evidence

17 Example of head-to-head connection A toy regulatory network: CPTs Genes A and B have independent prior probabilities: gene value P(value) A active 0.3 A inactive 0.7 gene value P(value) B active 0.3 B inactive 0.7 Gene C can be enhanced by both A and B: A active inactive B B active inactive active inactive C active C inactive

18 Example of head-to-head connection Probability of A active (1) Prior: P(A = 1) = 1 P(A = 0) = 0.3 Posterior after observing active C: P(C = 1 A = 1)P(A = 1) P(A = 1 C = 1) = P(C = 1) Note The probability that A is active increases from observing that its regulated gene C is active

19 Example of head-to-head connection Derivation P(C = 1 A = 1) = P(C = 1, B A = 1) B {0,1} = P(C = 1 B, A = 1)P(B A = 1) B {0,1} = P(C = 1 B, A = 1)P(B) B {0,1} P(C = 1) = P(C = 1, B, A) B {0,1} A {0,1} = B {0,1} A {0,1} P(C = 1 B, A)P(B)P(A)

20 Example of head-to-head connection Probability of A active Posterior after observing that B is also active: P(A = 1 C = 1, B = 1) = P(C = 1 A = 1, B = 1)P(A = 1 B = 1) P(C = 1 B = 1) Note The probability that A is active decreases after observing that B is also active The B condition explains away the observation that C is active The probability is still greater than the prior one (0.3), because the C active observation still gives some evidence in favour of an active A

21 General d-separation criterion d-separation definition Given a generic Bayesian network Given A, B, C arbitrary nonintersecting sets of nodes The sets A and B are d-separated by C if: All paths from any node in A to any node in B are blocked A path is blocked if it includes at least one node s.t. either: the arrows on the path meet tail-to-tail or head-to-tail at the node and it is in C, or the arrows on the path meet head-to-head at the node and neither it nor any of its descendants is in C d-separation implies conditional independency The sets A and B are independent given C ( A B C ) if they are d-separated by C.

22 Example of general d-separation a b c Nodes a and b are not d-separated by c: Node f is tail-to-tail and not observed Node e is head-to-head and its child c is observed a e c f b a b f Nodes a and b are d-separated by f : a f Node f is tail-to-tail and observed e b c

23 Example of i.i.d. samples Maximum-likelihood We are given a set of instances D = {x 1,..., x N } drawn from an univariate Gaussian with unknown mean µ x 1 µ x N All paths between x i and x j are blocked if we condition on µ µ The examples are independent of each other given µ: N p(d µ) = N p(x i µ) i=1 A set of nodes with the same variable type and connections can be compactly represented using the plate notation x n N

24 Markov blanket (or boundary) Definition Given a directed graph with D nodes The markov blanket of node x i is the minimal set of nodes making it x i independent on the rest of the graph: p(x i x j i ) = p(x 1,..., x D ) p(x j i ) = p(x 1,..., x D ) p(x1,..., x D )dx i D k=1 = p(x k pa k ) D k=1 p(x k pa k )dx i All components which do not include x i will cancel between numerator and denominator The only remaining components are: p(x i pa i ) the probability of x i given its parents p(x j pa j ) where pa j includes x i the children of x i with their co-parents

25 Markov blanket (or boundary) d-separation Each parent x j of x i will be head-to-tail or tail-to-tail in the path btw x i and any of x j other neighbours blocked Each child x j of x i will be head-to-tail in the path btw x i and any of x j children blocked Each co-parent x k of a child x j of x i be head-to-tail or tail-to-tail in the path btw x j and any of x k other neighbours blocked x i

26 Markov Networks (MN) or Markov random fields Undirected graphical models Graphically model the joint probability of a set of variables encoding independency assumptions (as BN) Do not assign a directionality to pairwise interactions Do not assume that local interactions are proper conditional probabilities (more freedom in parametrizing them) A C D B

27 Markov Networks (MN) A C B d-separation definition Given A, B, C arbitrary nonintersecting sets of nodes The sets A and B are d-separated by C if: All paths from any node in A to any node in B are blocked A path is blocked if it includes at least one node from C (much simpler than in BN) The sets A and B are independent given C ( A B C ) if they are d-separated by C.

28 Markov Networks (MN) Markov blanket The Markov blanket of a variable is simply the set of its immediate neighbour

29 Markov Networks (MN) Factorization properties We need to decompose the joint probability in factors (like P(x i pa i ) in BN) in a way consistent with the independency assumptions. Two nodes x i and x j not connected by a link are independent given the rest of the network (any path between them contains at least another node) they should stay in separate factors Factors are associated with cliques in the network (i.e. fully connected subnetworks) A possible option is to associate factors only with maximal cliques x 1 x 2 x 3 x 4

30 Markov Networks (MN) Joint probability (1) p(x) = 1 Z Ψ C (x C ) C Ψ C (x C ) : Val(x c ) IR + is a positive function of the value of the variables in clique x C (called clique potential) The joint probability is the (normalized) product of clique potentials for all (maximal) cliques in the network The partition function Z is a normalization constant assuring that the result is a probability (i.e. x p(x) = 1): Z = Ψ C (x C ) x C

31 Markov Networks (MN) Joint probability (2) In order to guarantee that potential functions are positive, they are typically represented as exponentials: Ψ C (x C ) = exp ( E(x c )) E(x c ) is called energy function: low energy means highly probable configuration The joint probability becames a sum of exponentials: ( p(x) = 1 Z exp ) E(x c ) C

32 Markov Networks (MN) Comparison to BN Advantage: More freedom in defining the clique potentials as they don t need to represent a probability distribution Problem: The partition function Z has to be computed summing over all possible values of the variables Solutions: Intractable in general Efficient algorithms exists for certain types of models (e.g. linear chains) Otherwise Z is approximated

33 MN Example: image de-noising Description We are given an image as a grid of binary pixels y i { 1, 1} The image is a noisy version of the original one, where each pixel was flipped with a certain (low) probability The aim is designing a model capable of recovering the clean version

34 MN Example: image de-noising y i x i Markov network We represent the true image as grid of points x i, and connect each point to its observed noisy version y i The MN has two maximal clique types: {x i, y i } for the true/observed pixel pair {x i, x j } for neighbouring pixels The MN has one unobserved non-maximal clique type: the true pixel {x i }

35 MN Example: image de-noising Energy functions We assume low noise the true pixel should usually have same sign as the observed one: E(x i, y i ) = αx i y i α > 0 same sign pixels have lower energy higher probability We assume neighbouring pixels should usually have same sign (locality principle, e.g. background colour): E(x i, x j ) = βx i x j β > 0 We assume the image can have preference for a certain pixel value (either positive or negative): E(x i ) = γx i

36 MN Example: image de-noising Joint probability The joint probability becomes: p(x, y) = 1 Z exp ( E((x, y) C )) (x,y) C = 1 Z exp α x i x j + β x i y i γ i,j i i x i α, β, γ are parameters of the model.

37 MN Example: image de-noising Note all cliques of the same type share the same parameter (e.g. α for cliques {x i, x j }) This parameter sharing (or tying) allows to: reduce the number of parameters, simplifying learning build a network template which will work on images of different sizes.

38 MN Example: image de-noising Inference Assuming parameters are given, the network can be used to produce a clean version of the image: 1 we replace y i with its observed value for each i 2 we compute the most most probable configuration for the true pixels given the observed ones The second computation is complex and can be done in approximate or exact ways (we ll see) with results of different quality.

39 MN Example: image de-noising

40 Relationship between directed and undirected graphs Independency assumptions Directed an undirected graphs are different ways of representing conditional independency assumptions Some assumptions can be perfectly represented in both formalisms, some in only one, some in neither of the two However there is a tight connection between the two It is rather simple to convert a specific directed graphical model into an undirected one (but different directed graphs can be mapped to the same undirected graph) The converse is more complex because of normalization issues

41 From a directed to an undirected model (a) x 1 x 2 x N 1 x N (b) x 1 x 2 x N 1 x N Linear chain The joint probability in the directed model (a) is: p(x) = p(x 1 )p(x 2 x 1 )p(x 3 x 2 ) p(x N x N 1 ) The joint probability in the undirected model (b) is: p(x) = 1 Z ψ 1,2(x 1, x 2 )ψ 2,3 (x 2, x 3 ) ψ N 1,N (x N 1, x N ) In order for the latter to represent the same distribution as the former, we set: ψ 1,2 (x 1, x 2 ) = p(x 1 )p(x 2 x 1 ) ψ 2,3 (x 2, x 3 ) = p(x 3 x 2 ) ψ N 1,N (x N 1, x N ) = p(x N x N 1 ) Z = 1

42 From a directed to an undirected model x 1 x 3 x 1 x 3 x 2 x 2 x 4 x 4 Multiple parents If a node has multiple parents, we need to construct a clique containing all of them, in order for its potential to represent the conditional probability: We add an undirected link between each pair of parents of the node The process is called moralization (marrying the parents! old fashioned..)

43 From a directed to an undirected model Generic graph 1 Build the moral graph connecting the parents in the directed graphs and dropping all arrows 2 initialize all clique potentials to 1 3 for each conditional probability, choose a clique containing all its variables, and multiply its potential with the conditional probability 4 set Z = 1

44 Inference in graphical models Description Assume we have evidence e on the state of a subset of variables in the model E Inference amounts at computing the posterior probability of a subset X of the non-observed variables given the observations: p(x E = e) Note When we need to distinguish between variables and their values, we will indicate random variables with uppercase letters, and their values with lowercase ones.

45 Inference in graphical models Efficiency We can always compute the posterior probability as the ratio of two joint probabilities: p(x E = e) = p(x, E = e) p(e = e) The problem consists of estimating such joint probabilities when dealing with a large number of variables Directly working on the full joint probabilities requires time exponential in the number of variables For instance, if all N variables are discrete and take one of K possible values, a joint probability table has K N entries We would like to exploit the structure in graphical models to do inference more efficiently.

46 Inference in graphical models Inference on a chain (1) p(x) = 1 Z ψ 1,2(x 1, x 2 ) ψ xn 2,x N 1 (x N 2, x N 1 )ψ N 1,N (x N 1, x N ) The marginal probability of an arbitrary x n is: p(x n ) = p(x) x 1 x 2 x n 1 x n+1 x N Only the ψ N 1,N (x N 1, x N ) is involved in the last summation which can be computed first: µ β (x N 1 ) = x N ψ N 1,N (x N 1, x N ) giving a function of x N 1 which can be used to compute the successive summation: µ β (x N 2 ) = x N 1 ψ N 2,N 1 (x N 2, x N 1 )µ β (x N 1 )

47 Inference in graphical models Inference on a chain (2) The same procedure can be applied starting from the other end of the chain, giving µ α (x 2 ) = x 1 ψ 1,2 (x 1, x 2 ) up to µ α (x n ) The marginal probability is thus computed from the normalized contribution coming from both ends as: p(x n ) = 1 Z µ α(x n )µ β (x n ) The partition function Z can now be computed just summing µ α (x n )µ β (x n ) over all possible values of x n

48 Inference in graphical models µ α (x n 1 ) µ α (x n ) µ β (x n ) µ β (x n+1 ) x 1 x n 1 x n x n+1 x N Inference as message passing We can think of µ α (x n ) as a message passing from x n 1 to x n µ α (x n ) = x n 1 ψ xn 1,x n (x n 1, x n )µ α (x n 1 ) We can think of µ β (x n ) as a message passing from x n+1 to x n µ β (x n ) = x n+1 ψ xn,x n+1 (x n, x n+1 )µ β (x n+1 ) Each outgoing message is obtained multiplying the incoming message by the clique potential, and summing over the node values

49 Inference in graphical models Full message passing Suppose we want to know marginal probabilities for a number of different variables x i : 1 We send a message from µ α (x 1 ) up to µ α (x N ) 2 We send a message from µ β (x N ) down to µ β (x 1 ) If all nodes store messages, we can compute any marginal probability as p(x i ) = µ α (x i )µ β (x i ) for any i having sent just a double number of messages wrt a single marginal computation

50 Inference in graphical models Adding evidence If some nodes x e are observed, we simply use their observed values instead of summing over all possible values when computing their messages After normalization by the partition function Z, this gives the conditional probability given the evidence Joint distribution for clique variables It is easy to compute the joint probability for variables in the same clique: p(x n 1, x n ) = 1 Z µ α(x n 1 )ψ xn 1,x n (x n 1, x n )µ β (x n )

51 Inference Directed models The procedure applies straightforwardly to a directed model, replacing clique potentials with conditional probabilities The marginal p(x n ) does not need to be normalized as its already a correct distribution When adding evidence, the message passing procedure computes the joint probability of the variable and the evidence, and it has to be normalized to obtain the conditional probability: p(x n X e = x e ) = p(x n, X e = x e ) p(x e = x e ) = p(x n, X e = x e ) X n p(x n, X e = x e )

52 Inference Inference on trees (a) (b) (c) Efficient inference can be computed for the broaded family of tree-structured models: undirected trees (a) undirected graphs with a single path for each pair of nodes directed trees (b) directed graphs with a single node (the root) with no parents, and all other nodes with a single parent directed polytrees (c) directed graphs with multiple parents for node and multiple roots, but still as single (undirected) path between each pair of nodes

53 Factor graphs Description Efficient inference algorithms can be better explained using an alternative graphical representation called factor graph A factor graph is a graphical representation of a (directed or undirected) graphical model highlighting its factorization (conditional probabilities or clique potentials) The factor graph has one node for each node in the original graph The factor graph has one additional node (of a different type) for each factor A factor node has undirected links to each of the node variables in the factor

54 Factor graphs: examples x 1 x 2 x 1 x 2 f x 1 x 2 f c f a f b x 3 x 3 x 3 p(x3 x1,x2)p(x1)p(x2) f(x1,x2,x3)=p(x3 x1,x2)p(x1)p(x2) fc(x1,x2,x3)=p(x3 x1,x2) fa(x1)=p(x1) fb(x2)=p(x2)

55 Factor graphs: examples x 1 x 2 x 1 x 2 f x 1 x 2 f a f b x 3 x 3 x 3 Note f is associated to the whole clique potential f a is associated to a maximal clique f b is associated to a non-maximal clique

56 Inference The sum-product algorithm The sum-product algorithm is an efficient algorithm for exact inference on tree-structured graphs It is a message passing algorithm as its simpler version for chains We will present it on factor graphs, thus unifying directed and undirected models We will assume a tree-structured graph, giving rise to a factor graph which is a tree

57 Inference Computing marginals We want to compute the marginal probability of x: p(x) = p(x) x\x Generalizing the message passing scheme seen for chains, this can be computed as the product of messages coming from all neighbouring factors f s : p(x) = 1 Z s ne(x) µ fs x(x) Fs(x, Xs) f s µ fs x(x) x

58 Inference x M µ xm f s (x M ) f s µ fs x(x) x Factor messages x m G m (x m, X sm ) Each factor message is the product of messages coming from nodes other than x, times the factor, summed over all possible values of the factor variables other than x (x 1,..., x M ): µ fs x(x) = x 1 x M f s (x, x 1,..., x M ) m ne(f s)\x µ xm f s (x m )

59 Inference f L x m f s f l F l (x m, X ml ) Node messages Each message from node x m to factor f s is the product of the factor messages to x m coming from factors other than f s : µ xm fs (x m ) = µ fl x m (x m ) l ne(x m)\f s

60 Inference Initialization Message passing starts from leaves, either factors or nodes Messages from leaf factors are initialized to the factor itself (there will be no x m different from the destination on which to sum over) µ f x (x) = f(x) f x Messages from leaf nodes are initialized to 1 µ x f (x) = 1 x f

61 Inference Message passing scheme The node x whose marginal has to be computed is designed as root. Messages are sent from all leaves to their neighbours Each internal node sends its message towards the root as soon as it received messages from all other neighbours Once the root has collected all messages, the marginal can be computed as the product of them

62 Inference Full message passing scheme In order to be able to compute marginals for any node, messages need to pass in all directions: 1 Choose an arbitrary node as root 2 Collect messages for the root starting from leaves 3 Send messages from the root down to the leaves All messages passed in all directions using only twice the number of computations used for a single marginal

63 Inference example x 1 x 2 x 3 f a f b f c x 4 Consider the unnormalized distribution p(x) = f a (x 1, x 2 )f b (x 2, x 3 )f c (x 2, x 4 )

64 Inference example x 1 x 2 x 3 f a f b f c x 4 Choose x 3 as root

65 Inference example x 1 x 2 x 3 x 4 Send initial messages from leaves µ x1 f a (x 1 ) = 1 µ x4 f c (x 4 ) = 1

66 Inference example x 1 x 2 x 3 x 4 Send messages from factor nodes to x 2 µ fa x 2 (x 2 ) = x 1 f a (x 1, x 2 ) µ fc x 2 (x 2 ) = x 4 f c (x 2, x 4 )

67 Inference example x 1 x 2 x 3 x 4 Send message from x 2 to factor node f b µ x2 f b (x 2 ) = µ fa x 2 (x 2 )µ fc x 2 (x 2 )

68 Inference example x 1 x 2 x 3 x 4 Send message from f b to x 3 µ fb x 3 (x 3 ) = x 2 f b (x 2, x 3 )µ x2 f b (x 2 )

69 Inference example x 1 x 2 x 3 x 4 Send message from root x 3 µ x3 f b (x 3 ) = 1

70 Inference example x 1 x 2 x 3 x 4 Send message from f b to x 2 µ fb x 2 (x 2 ) = x 3 f b (x 2, x 3 )

71 Inference example x 1 x 2 x 3 x 4 Send messages from x 2 to factor nodes µ x2 f a (x 2 ) = µ fb x 2 (x 2 )µ fc x 2 (x 2 ) µ x2 f c (x 2 ) = µ fb x 2 (x 2 )µ fa x 2 (x 2 )

72 Inference example x 1 x 2 x 3 x 4 Send messages from factor nodes to leaves µ fa x 1 (x 1 ) = x 2 f a (x 1, x 2 )µ x2 f a (x 2 ) µ fc x 4 (x 4 ) = x 2 f c (x 2, x 4 )µ x2 f c (x 2 )

73 Inference example x 1 x 2 x 3 f a f b f c x 4 Compute for instance the marginal for x 2 p(x 2 ) = µ fa x2 (x 2 )µ fb x 2 (x 2 )µ fc x2 (x 2 ) [ ] [ ] [ ] = f a (x 1, x 2 ) f b (x 2, x 3 ) f c (x 2, x 4 ) x 1 x 3 x 4 = f a (x 1, x 2 )f b (x 2, x 3 )f c (x 2, x 4 ) x 1 x 3 x 4 = p(x) x 1 x 3 x 4

74 Inference Normalization p(x) = 1 Z s ne(x) µ fs x(x) If the factor graph came from a directed model, a correct probability is already obtained as the product of messages (Z = 1) If the factor graph came from an undirected model, we simply compute the normalization Z summing the product of messages over all possible values of the target variable: Z = x s ne(x) µ fs x(x)

75 Inference Factor marginals p(x s ) = 1 Z f s(x s ) m ne(f s) µ xm f s (x m ) factor marginals can be computed from the product of neighbouring nodes messages and the factor itself Adding evidence If some nodes x e are observed, we simply use their observed values instead of summing over all possible values when computing their messages After normalization by the partition function Z, this gives the conditional probability given the evidence

76 Inference example A C D A P(A) C P(C D) D P(D) B P(B A,C)P(A)P(C D)P(D) P(B A,C) B Directed model Take a directed graphical models Build a factor graph representing it Compute the marginal for a variable (e.g. B)

77 Inference example A C D A P(A) C P(C D) D P(D) B P(B A,C)P(A)P(C D)P(D) P(B A,C) B Compute the marginal for B Leaf factor nodes send messages: µ fa A(A) = P(A) µ fd D(D) = P(D)

78 Inference example A C D A P(A) C P(C D) D P(D) B P(B A,C)P(A)P(C D)P(D) P(B A,C) B Compute the marginal for B A and D send messages: µ A fa,b,c (A) = µ fa A(A) = P(A) µ D fc,d (D) = µ fd D(D) = P(D)

79 Inference example A C D A P(A) C P(C D) D P(D) B P(B A,C)P(A)P(C D)P(D) P(B A,C) B Compute the marginal for B f C,D sends message: µ fc,d C(C) = D P(C D)µ fd D(D) = D P(C D)P(D)

80 Inference example A C D A P(A) C P(C D) D P(D) B P(B A,C)P(A)P(C D)P(D) P(B A,C) B Compute the marginal for B C sends message: µ C fa,b,c (C) = µ fc,d C(C) = D P(C D)P(D)

81 Inference example A C D A P(A) C P(C D) D P(D) B P(B A,C)P(A)P(C D)P(D) P(B A,C) B Compute the marginal for B f A,B,C sends message: µ fa,b,c B(B) = P(B A, C)µ C fa,b,c (C)µ A fa,b,c (A) A C = P(B A, C)P(A) P(C D)P(D) A D C

82 Inference example A C D A P(A) C P(C D) D P(D) B P(B A,C)P(A)P(C D)P(D) P(B A,C) B Compute the marginal for B The desired marginal is obtained: P(B) = µ fa,b,c B(B) = P(B A, C)P(A) A C D = P(B A, C)P(A)P(C D)P(D) A C D = P(A, B, C, D) A D C P(C D)P(D)

83 Inference example A C D A P(A) C P(C D) D P(D) B P(B A,C)P(A)P(C D)P(D) P(B A,C) B Compute the joint for A, B, C Send messages to factor node f A,B,C P(A, B, C) = f A,B,C (A, B, C) µ A fa,b,c (A) µ C fa,b,c (C) µ B fa,b,c (B) = P(B A, C) P(A) P(C D)P(D) 1 = P(B A, C)P(A)P(C D)P(D) D D = D P(A, B, C, D)

84 Inference Finding the most probable configuration Note Given a joint probability distribution p(x) We wish to find the configuration for variables x having the highest probability: for which the probability is: x max = argmax x p(x) p(x max ) = max p(x) x We want the configuration which is jointly maximal for all variables We cannot simply compute p(x i ) for each i (using the sum-product algorithm) and maximize it

85 Inference The max-product algorithm p(x max ) = max x p(x) = max x 1 max p(x) x M As for the sum-product algorithm, we can exploit the distribution factorization to efficiently compute the maximum It suffices to replace sum with max in the sum-product algorithm Linear chain max x p(x) = max x 1 = 1 Z max x 1 [ ] 1 max x N Z ψ 1,2(x 1, x 2 ) ψ N 1,N (x N 1, x N ) [ [ ]] ψ 1,2 (x 1, x 2 ) max ψ N 1,N (x N 1, x N ) x N

86 Inference Message passing As for the sum-product algorithm, the max-product can be seen as message passing over the graph. The algorithm is thus easily applied to tree-structured graphs via their factor trees: µ f x (x) = max f (x, x 1,..., x M ) x 1,...x M µ x f (x) = µ fl x(x) l ne(x)\f m ne(f )\x µ xm f (x m )

87 Inference Recoving maximal configuration Messages are passed from leaves to an arbitrarily chosen root x r The probability of maximal configuration is readily obtained as: p(x max ) = max x r l ne(x r ) µ fl x r (x r ) The maximal configuration for the root is obtained as: = argmax xr µ fl x r (x r ) x max r l ne(x r ) We need to recover maximal configuration for the other variables

88 Inference Recoving maximal configuration When sending a message towards x, each factor node should store the configuration of the other variables which gave the maximum: φ f x (x) = argmax x1,...,x M f (x, x 1,..., x M ) m ne(f )\x µ xm f (x m ) When the maximal configuration for the root node x r has been obtained, it can be used to retrieve the maximal configuration for the variables in neighbouring factors from: x1 max,..., xm max = φ f x r (xr max ) The procedure can be repeated back-tracking to the leaves, retrieving maximal values for all variables

89 Recoving maximal configuration Example for linear chain xn max = argmax xn µ fn 1,N x N (x N ) xn 1 max = φ f N 1,N x N (xn max ) xn 2 max = φ f N 2,N 1 x N 1 (xn 1 max ). x max 1 = φ f1,2 x 2 (x max 2 )

90 Recoving maximal configuration k = 1 k = 2 k = 3 n 2 n 1 n n + 1 Trellis for linear chain A trellis or lattice diagram shows the K possible states of each variable x n one per row For each state k of a variable x n, φ fn 1,n x n (x n ) defines a unique (maximal) previous state, linked by an edge in the diagram Once the maximal state for the last variable x N is chosen, the maximal states for other variables are recovering following the edges backward.

91 Inference Underflow issues The max-product algorithm relies on products (no summation) Products of many small probabilities can lead to underflow problems This can be addressed computing the logarithm of the probability instead The logarithm is monotonic, thus the proper maximal configuration is recovered: ( ) log max p(x) = max log p(x) x x The effect is replacing products with sums (of logs) in the max-product algorithm, giving the max-sum one

92 Inference Instances of inference algorithms The sum-product and max-product algorithms are generic algorithms for exact inference in tree-structured models. Inference algorithms for certain specific types of models can be seen as special cases of them We will see examples (forward-backward as sum-product and Viterbi as max-product) for Hidden Markov Models and Conditional Random Fields

93 Inference Exact inference on general graphs The sum-product and max-product algorithms can be applied to tree-structured graphs Many pratical application require graphs with loops An extension of this algorithms to generic graphs can be achieved with the junction tree algorithm The algorithm does not work on factor graphs, but on junction trees, tree-structured graphs with nodes containing clusters of variables of the original graph A message passing scheme analogous to the sum-product and max-product algorithms is run on the junction tree Problem The complexity on the algorithm is exponential on the maximal number of variables in a cluster, making it intractable for large complex graphs.

94 Inference Approximate inference In cases in which exact inference is intractable, we resort to approximate inference techniques A number of techniques for approximate inference exist: loopy belief propagation message passing on the original graph even if it contains loops variational methods deterministic approximations, assuming the posterior probability (given the evidence) factorizes in a particular way sampling methods approximate posterior is obtained sampling from the network

95 Inference Loopy belief propagation Apply sum-product algorithm even if it is not guaranteed to provide an exact solution We assume all nodes are in condition of sending messages (i.e. they already received a constant 1 message from all neighbours) A message passing schedule is chosen in order to decide which nodes start sending messages (e.g. flooding, all nodes send messages in all directions at each time step) Information flows many times around the graph (because of the loops), each message on a link replaces the previous one and is only based on the most recent messages received from the other neighbours The algorithm can eventually converge (no more changes in messages passing through any link) depending on the specific model over which it is applied

96 Approximate inference Sampling methods Given the joint probability distribution p(x) A sample from the distribution is an instantiation of all the variables X according to the probability p. Samples can be used to approximate the probability of a certain assignment Y = y for a subset Y X of the variables: We simply count the fraction of samples which are consistent with the desired assignment We usually need to sample from a posterior distribution given some evidence E = e

97 Sampling methods Markov Chain Monte Carlo (MCMC) We usually cannot directly sample from the posterior distribution (too expensive) We can instead build a random process which gradually samples from distributions closer and closer to the posterior The state of the process is an instantiation of all the variables The process can be seen as randomly traversing the graph of states moving from one state to another with a certain probability. After enough time, the probability of being in any particular state is the desired posterior. The random process is a Markov Chain

98 Learning graphical models Parameter estimation We assume the structure of the model is given We are given a dataset of examples D = {x(1),..., x(n)} Each example x(i) is a configuration for all (complete data) or some (incomplete data) variables in the model We need to estimate the parameters of the model (CPD or weights of clique potentials) from the data The simplest approach consists of learning the parameters maximizing the likelihood of the data: θ max = argmax θ p(d θ) = argmax θ L(D, θ)

99 Learning Bayesian Networks Θ1 X1(1) X1(2) X1(N) X2(1) X3(1) X2(2) X3(2) X2(N) X3(N) Θ2 1 Θ3 1 Maximum likelihood estimation, complete data p(d θ) = = N p(x(i) θ) i=1 N i=1 j=1 m p(x j (i) pa j (i), θ) examples independent given θ factorization for BN

100 Learning Bayesian Networks Θ1 X1(1) X1(2) X1(N) X2(1) X3(1) X2(2) X3(2) X2(N) X3(N) Θ2 1 Θ3 1 Maximum likelihood estimation, complete data N m p(d θ) = p(x j (i) pa j (i), θ) i=1 j=1 N m = p(x j (i) pa j (i), θ Xj pa j ) i=1 j=1 factorization for BN disjoint CPD parameters

101 Learning graphical models Maximum likelihood estimation, complete data The parameters of each CPD can be estimated independently: θ max X j Pa j = argmax θxj Pa j N p(x j (i) pa j (i), θ Xj Pa j ) i=1 } {{ } L(θ Xj Pa,D) j A discrete CPD P(X U), can be represented as a table, with: a number of rows equal to the number Val(X) of configurations for X a number of columns equal to the number Val(U) of configurations for its parents U each table entry θ x u indicating the probability of a specific configuration of X = x and its parents U = u

102 Learning graphical models Maximum likelihood estimation, complete data Replacing p(x(i) pa(i)) with θ x(i) u(i), the local likelihood of a single CPD becames: L(θ X Pa, D) = = N p(x(i) pa(i), θ X Paj ) i=1 N i=1 = θ x(i) u(i) u Val(U) x Val(X) θ Nu,x x u where N u,x is the number of times the specific configuration X = x, U = u was found in the data

103 Learning graphical models Maximum likelihood estimation, complete data A column in the CPD table contains a multinomial distribution over values of X for a certain configuration of the parents U Thus each column should sum to one: x θ x u = 1 Parameters of different columns can be estimated independently For each multinomial distribution, zeroing the gradient of the maximum likelihood and considering the normalization constraint, we obtain: θx u max = N u,x x N u,x The maximum likelihood parameters are simply the fraction of times in which the specific configuration was observed in the data

104 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Fill CPTs with counts

105 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Fill CPTs with counts gene value counts A active 4 A inactive 8 gene value counts B active 3 B inactive 9

106 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Normalize counts columnwise gene value counts A active 4/12 A inactive 8/12 gene value counts B active 3/12 B inactive 9/12

107 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Normalize counts columnwise gene value counts A active 0.33 A inactive 0.67 gene value counts B active 0.25 B inactive 0.75

108 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Fill CPTs with counts A active inactive B B active inactive active inactive C active C inactive

109 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Fill CPTs with counts A active inactive B B active inactive active inactive C active C inactive

110 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Normalize counts columnwise A active inactive B B active inactive active inactive C active 1/1 2/3 1/2 0/6 C inactive 0/1 1/3 1/2 6/6

111 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Normalize counts columnwise A active inactive B B active inactive active inactive C active C inactive

112 Learning graphical models Adding priors ML estimation tends to overfit the training set Configuration not appearing in the training set will receive zero probability A common approach consists of combining ML with a prior probability on the parameters, achieving a maximum-a-posteriori estimate: θ max = argmax θ p(d θ)p(θ)

113 Learning graphical models Dirichlet priors The conjugate (read natural) prior for a multinomial distribution is a Dirichlet distribution with parameters α x u for each possible value of x The resulting maximum-a-posteriori estimate is: x u = N u,x + α x u ( ) Nu,x + α x u θ max x The prior is like having observed α x u imaginary samples with configuration X = x, U = u

114 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Fill CPTs with imaginary counts (priors) A active inactive B B active inactive active inactive C active C inactive

115 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Fill CPTs with imaginary counts (priors) A active inactive B B active inactive active inactive C active C inactive

116 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Add counts from training data A active inactive B B active inactive active inactive C active C inactive

117 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Normalize counts columnwise A active inactive B B active inactive active inactive C active 2/3 3/4 2/4 1/8 C inactive 1/3 2/4 2/4 7/8

118 Learning Bayesian Networks Example Training examples as (A, B, C) tuples: (act,act,act),(act,inact,act), (act,inact,act),(act,inact,inact), (inact,act,act),(inact,act,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact), (inact,inact,inact),(inact,inact,inact). Normalize counts columnwise A active inactive B B active inactive active inactive C active C inactive

119 Learning graphical models Incomplete data With incomplete data, some of the examples miss evidence on some of the variables Counts of occurrences of different configurations cannot be computed if not all data are observed The full Bayesian approach of integrating over missing variables is often intractable in practice We need approximate methods to deal with the problem

120 Learning with missing data: Expectation-Maximization E-M for Bayesian nets in a nutshell Sufficient statistics (counts) cannot be computed (missing data) Fill-in missing data inferring them using current parameters (solve inference problem to get expected counts) Compute parameters maximizing likelihood (or posterior) of such expected counts Iterate the procedure to improve quality of parameters

121 Learning in Markov Networks Maximum likelihood estimation For general MN, the likelihood function cannot be decomposed into independent pieces for each potential because of the global normalization (Z ) L(E, D) = N i=1 ( 1 Z exp C E c (x c (i)) However the likelihood is concave in E C, the energy functionals whose parameters have to be estimated The problem is an unconstrained concave maximization problem solved by gradient ascent (or second order methods) )

122 Learning in Markov Networks Maximum likelihood estimation For each configuration u c Val(x c ) we have a parameter E c,uc IR (ignoring parameter tying) The partial derivative of the log-likelihood wrt E c,uc is: log L(E, D) E c,uc = N [δ(x c (i), u c ) p(x c = u c E)] i=1 = N uc N p(x c = u c E) The derivative is zero when the counts of the data correspond to the expected counts predicted by the model In order to compute p(x c = u c E), inference has to be performed on the Markov network, making learning quite expensive

123 Learning structure of graphical models Approaches constraint-based test conditional independencies on the data and construct a model satisfying them score-based assign a score to each possible structure, define a search procedure looking for the structure maximizing the score model-averaging assign a prior probability to each structure, and average prediction over all possible structures weighted by their probabilities (full Bayesian, intractable)

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

{ p if x = 1 1 p if x = 0

{ p if x = 1 1 p if x = 0 Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

Statistical Learning

Statistical Learning Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A.

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Directed Graphical Models

Directed Graphical Models CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Graphical Models - Part I

Graphical Models - Part I Graphical Models - Part I Oliver Schulte - CMPT 726 Bishop PRML Ch. 8, some slides from Russell and Norvig AIMA2e Outline Probabilistic Models Bayesian Networks Markov Random Fields Inference Outline Probabilistic

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Probabilistic Graphical Models: Representation and Inference

Probabilistic Graphical Models: Representation and Inference Probabilistic Graphical Models: Representation and Inference Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Andrew Moore 1 Overview

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Graphical Models Variable Elimination and Message Passing Algorithm Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Conditional Independence

Conditional Independence Conditional Independence Sargur Srihari srihari@cedar.buffalo.edu 1 Conditional Independence Topics 1. What is Conditional Independence? Factorization of probability distribution into marginals 2. Why

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Lecture 8: Bayesian Networks

Lecture 8: Bayesian Networks Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

Graphical Models 359

Graphical Models 359 8 Graphical Models Probabilities play a central role in modern pattern recognition. We have seen in Chapter 1 that probability theory can be expressed in terms of two simple equations corresponding to

More information

Variable Elimination: Algorithm

Variable Elimination: Algorithm Variable Elimination: Algorithm Sargur srihari@cedar.buffalo.edu 1 Topics 1. Types of Inference Algorithms 2. Variable Elimination: the Basic ideas 3. Variable Elimination Sum-Product VE Algorithm Sum-Product

More information

The Ising model and Markov chain Monte Carlo

The Ising model and Markov chain Monte Carlo The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 16 Undirected Graphs Undirected Separation Inferring Marginals & Conditionals Moralization Junction Trees Triangulation Undirected Graphs Separation

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian

More information

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2 Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 32 Learning a generative model We are given a training

More information

Review: Directed Models (Bayes Nets)

Review: Directed Models (Bayes Nets) X Review: Directed Models (Bayes Nets) Lecture 3: Undirected Graphical Models Sam Roweis January 2, 24 Semantics: x y z if z d-separates x and y d-separation: z d-separates x from y if along every undirected

More information

CMPSCI 240: Reasoning about Uncertainty

CMPSCI 240: Reasoning about Uncertainty CMPSCI 240: Reasoning about Uncertainty Lecture 17: Representing Joint PMFs and Bayesian Networks Andrew McGregor University of Massachusetts Last Compiled: April 7, 2017 Warm Up: Joint distributions Recall

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Lecture 6: Graphical Models

Lecture 6: Graphical Models Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:

More information

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell Machine Learning 0-70/5 70/5-78, 78, Spring 00 Probability 0 Aarti Singh Lecture, January 3, 00 f(x) µ x Reading: Bishop: Chap, Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell Announcements Homework

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1 Bayes Networks CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 59 Outline Joint Probability: great for inference, terrible to obtain

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

Variable Elimination: Algorithm

Variable Elimination: Algorithm Variable Elimination: Algorithm Sargur srihari@cedar.buffalo.edu 1 Topics 1. Types of Inference Algorithms 2. Variable Elimination: the Basic ideas 3. Variable Elimination Sum-Product VE Algorithm Sum-Product

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University Learning from Sensor Data: Set II Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University 1 6. Data Representation The approach for learning from data Probabilistic

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Graphical Models - Part II

Graphical Models - Part II Graphical Models - Part II Bishop PRML Ch. 8 Alireza Ghane Outline Probabilistic Models Bayesian Networks Markov Random Fields Inference Graphical Models Alireza Ghane / Greg Mori 1 Outline Probabilistic

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

More information

The Monte Carlo Method: Bayesian Networks

The Monte Carlo Method: Bayesian Networks The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks

More information

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination Probabilistic Graphical Models COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Outline It Introduction ti Representation Bayesian network Conditional Independence Inference:

More information

An Introduction to Probabilistic Graphical Models

An Introduction to Probabilistic Graphical Models An Introduction to Probabilistic Graphical Models Cédric Archambeau Xerox Research Centre Europe cedric.archambeau@xrce.xerox.com Pascal Bootcamp Marseille, France, July 2010 Reference material D. Koller

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Introduction to Discrete Probability Theory and Bayesian Networks

Introduction to Discrete Probability Theory and Bayesian Networks Introduction to Discrete Probability Theory and Bayesian Networks Dr Michael Ashcroft October 10, 2012 This document remains the property of Inatas. Reproduction in whole or in part without the written

More information

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence. Bayes Nets CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Networks: Representation, Variable Elimination

Bayesian Networks: Representation, Variable Elimination Bayesian Networks: Representation, Variable Elimination CS 6375: Machine Learning Class Notes Instructor: Vibhav Gogate The University of Texas at Dallas We can view a Bayesian network as a compact representation

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Notes on Markov Networks

Notes on Markov Networks Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

Bayesian networks: approximate inference

Bayesian networks: approximate inference Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact

More information

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013 Introduction to Bayes Nets CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Introduction Review probabilistic inference, independence and conditional independence Bayesian Networks - - What

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics EECS 281A / STAT 241A Statistical Learning Theory Solutions to Problem Set 1 Fall 2011 Issued: Thurs, September

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information