Graphical Models (Lecture 1 - Introduction)

Size: px

Start display at page:

Download "Graphical Models (Lecture 1 - Introduction)"

Joshua Newton
6 years ago
Views:

1 Grahical Models Lecture - Introduction Tibério Caetano tiberiocaetano.com Statistical Machine Learning Grou NICTA Canberra LLSS Canberra 009 Tibério Caetano: Grahical Models Lecture - Introduction / 7

2 Material on Grahical Models Many good books Chris Bisho s book Pattern Recognition and Machine Learning Grahical Models chater available from his webage in df format as well as all the figures many used here in these slides! Judea Pearl s Probabilistic Reasoning in Intelligent Systems Stehen Lauritzen s Grahical Models Unublished material Michael Jordan s unublished book An Introduction to Probabilistic Grahical Models Koller and Friedman s unublished book Structured Probabilistic Models Videos Sam Roweis videos on videolectures.net Ecellent! Tibério Caetano: Grahical Models Lecture - Introduction / 7

3 Introduction Query in quotes # results in Google Scholar Kalman Filter >0000 EM algorithm > Hidden Markov Models > Bayesian Networks > 600 Markov Random Fields > 5000 Particle Filters > 4000 Miture Models > 4000 Conditional Random Fields > 500 Markov Chain Monte Carlo > Gibbs Samling > 8000 Tibério Caetano: Grahical Models Lecture - Introduction / 7

4 Introduction Grahical Models have been alied to Image Processing Seech Processing Natural Language Processing Document Processing Pattern Recognition Bioinformatics Comuter Vision Economics Physics Social Sciences Tibério Caetano: Grahical Models Lecture - Introduction 4 / 7

5 Physics Tibério Caetano: Grahical Models Lecture - Introduction 5 / 7

6 Biology Tibério Caetano: Grahical Models Lecture - Introduction 6 / 7

7 Music Tibério Caetano: Grahical Models Lecture - Introduction 7 / 7

8 Comuter Vision Matching Tibério Caetano: Grahical Models Lecture - Introduction 8 / 7

9 Comuter Alications Vision in Vision and PR Matching Tibério Caetano: Grahical Models Lecture - Introduction 9 / 7

10 Image Processing l Tibério Caetano: Grahical Models Lecture - Introduction 0 / 7

11 Image Processing Tibério Caetano: Grahical Models Lecture - Introduction / 7

12 Introduction Technically Grahical Models are Multivariate robabilistic models... which are structured... in terms of conditional indeendence statements Informally Models that reresent a system by its arts and the ossible relations among them in a robabilistic way Tibério Caetano: Grahical Models Lecture - Introduction / 7

13 Introduction Questions we want to ask about these models Estimating the arameters of the model given data Obtaining data samles from the model Comuting robabilities of articular outcomes Finding most likely outcome Tibério Caetano: Grahical Models Lecture - Introduction / 7

14 Univariate Eamle = σ π e[ µ /σ ] Estimate µ and σ given X = {... n } Samle from Comute Pµ σ µ + σ := µ+σ d µ σ Find argma Tibério Caetano: Grahical Models Lecture - Introduction 4 / 7

15 Multivariate Case We want to do the same things For multivariate distributions structured according to CI Efficiently Accurately For eamle given... n ; θ Estimate θ given a samle X and criterion e.g. ML Comute A A {... n } marginal distributions Find argma... n... n ; θ MAP assignment Tibério Caetano: Grahical Models Lecture - Introduction 5 / 7

16 Naively... When trying to answer the relevant questions... =... N... N O X X X N and similarly for other questions: NOT GOOD We need comact reresentations of multivariate distributions Tibério Caetano: Grahical Models Lecture - Introduction 6 / 7

17 When to Use Grahical Models When the comactness of the model arises from conditional indeendence statements involving its random variables. CAUTION: Grahical Models are useful in such cases. If the robability sace is structured in different ways Grahical Models may not and in rincile should not be the right framework to reresent and deal with the robability distributions involved. Tibério Caetano: Grahical Models Lecture - Introduction 7 / 7

18 Grahical Models Lecture - Basics Tibério Caetano tiberiocaetano.com Statistical Machine Learning Grou NICTA Canberra LLSS Canberra 009 Tibério Caetano: Grahical Models Lecture - Basics /

19 Notation Basic definitions involving random quantities X - A random variable X = X... X N N - A articular realization of X: =... N X - Set of all realizations samle sace X A - A random vector of variables indeed by A {... N} A for realizations XÃ - The random vector comrised of all variables other than those in X A A Ã = {... N} A Ã =. Ã for realizations X A := { A } X A := { A } := robability that X assumes realization Basic roerties of robabilities 0 X X = Tibério Caetano: Grahical Models Lecture - Basics /

20 Conditioning and Marginalization Conditioning The two rules you will always need A B = A B B for B > 0 Marginalization A = Ã XÃ A Ã Tibério Caetano: Grahical Models Lecture - Basics /

21 Indeendence and Conditional Inde. Indeendence A B = A B Conditional Indeendence A B C = A C B C or equivalently A B C = A C or equivalently B A C = B C Notation: X A X B X C Tibério Caetano: Grahical Models Lecture - Basics 4 /

22 Conditional Indeendence Eamles Weather tomorrow Weather yesterday Weather today My Genome my grandarents Genome my arent s Genome My mood my wife s boss mood my wife s mood A iel s color color of far away iels color of surrounding iels Tibério Caetano: Grahical Models Lecture - Basics 5 /

23 Conditional Indeendence The KEY Fact is A B C }{{} f variables = A C }{{} f variables B C }{{} f variables factors as functions over roer subsets of variables Tibério Caetano: Grahical Models Lecture - Basics 6 /

24 Conditional Indeendence Therefore A B C }{{} f variables = A C }{{} f variables B C }{{} f variables A B C cannot assume arbitrary values for arbitrary A B C If you vary A and B for fied C you can only realize robabilities that satisfy the above condition Tibério Caetano: Grahical Models Lecture - Basics 7 /

25 What is a Grahical Model? What is a Grahical Model? Given a set of conditional indeendence statements for the random vector X = X... X N : {X Ai X Bi X Ci } Our object of study will be the family of robability distributions where these statements hold... N Tibério Caetano: Grahical Models Lecture - Basics 8 /

26 Questions to be addressed Tyical questions when we have a robabilistic model Estimate arameters of the model given data Comute robabilities of articular outcomes Find articularly interesting realizations e.g. MAP assignment In order to maniulate the robabilistic model We need to know the mathematical structure of We need to find ways of comuting efficiently and accurately in such structure Tibério Caetano: Grahical Models Lecture - Basics 9 /

27 An Eercise How does look like? Eamle: where = = = so = Tibério Caetano: Grahical Models Lecture - Basics 0 /

28 An Eercise However = is also = since = = since = Tibério Caetano: Grahical Models Lecture - Basics /

29 Cond. Inde. and Factorization So CI seems to generate factorization of! Is this useful for the questions we want to ask? Let s see an eamle of how eensive it is to comute = Without factorization: = O X X X With factorization: = = = O X X Tibério Caetano: Grahical Models Lecture - Basics /

30 Cond. Inde. and Factorization Therefore Conditional Indeendence seems to induce a structure in that allows us to eloit the distributive law in order to make comutations more tractable However what about the general case... N? What is the form that will take in general given a set of conditional indeendence statements? Will we be able to eloit the distributive law in this general case as well? Tibério Caetano: Grahical Models Lecture - Basics /

31 Re-Writing the Joint Distribution A little eercise... N =... N N... N... N =... N... N = N i= i <i where < i := {j : j < i j N + } now denote by π a ermutation of the labels {... N} such that π j < π i i j < i. Above we have π = i.e. π i = i So we can write = N i= π i <πi. Tibério Caetano: Grahical Models Lecture - Basics 4 /

32 Re-Writing the Joint Distribution So Any can be written as = N i= π i <πi. Now assume that the following CI statements hold πi <πi = πi aπi i where a πi < π i. Then we immediately get = N i= π i aπi Tibério Caetano: Grahical Models Lecture - Basics 5 /

33 Comuting in Grahical Models Algebra is boring so let s draw this Let s reresent variables as circles Let s draw an arrow from j to i if j a i The resulting drawing will be a Directed Grah Tyes of Grahical Models Moreover it will be Acyclic no directed cycles Eercise:why? Directed Grahical Models: X 4 X X X 6 X X 5 Tibério Caetano: Grahical Models Lecture - Basics 6 /

34 Comuting intyes of Grahical Models Directed Grahical Models: X 4 X X X 6 X X 5 =? Eercise Tibério Caetano: Grahical Models Lecture - Basics 7 /

35 Bayesian Networks This is why the name Grahical Models Such Grahical Models with arrows are called Bayesian Networks Bayes Nets Bayes Belief Nets Belief Networks Or more descritively: Directed Grahical Models Tibério Caetano: Grahical Models Lecture - Basics 8 /

36 Bayesian Networks A Bayesian Network associated to a DAG is a set of robability distributions where each element can be written as = i i ai where random variable i is reresented as a node in the DAG and a i = { j : arrow j i in the DAG }. a is for arents. Colloquially we say the BN is the DAG Tibério Caetano: Grahical Models Lecture - Basics 9 /

37 Toological Sorts A ermutation π of the node labels which for every node makes each of its arents have a smaller inde than that of the node is called a toological sort of the nodes in the DAG. Theorem: Every DAG has at least one toological sort Eercise: Prove Tibério Caetano: Grahical Models Lecture - Basics 0 /

38 A Little Eercise Revisited Remember?... N =... N N... N... N =... N... N = N i= i <i where < i := {j : j < i j N + } now denote by π a ermutation of the labels {... N} such that π j < π i i j < i. Above we have π = So we can write = N i= π i <πi. Tibério Caetano: Grahical Models Lecture - Basics /

39 Eercises Eercises: How many toological sorts has a BN where no CI statements hold? How many toological sorts has a BN where all CI statements hold? Tibério Caetano: Grahical Models Lecture - Basics /

40 Grahical Models Lecture -Bayesian Networks Tibério Caetano tiberiocaetano.com Statistical Machine Learning Grou NICTA Canberra LLSS Canberra 009 Tibério Caetano: Grahical Models Lecture -Bayesian Networks / 9

41 Some Key Elements of Lecture We can always write = i i <i Create a DAG with arrow j i whenever j < i Imose CI statements by removing some arrows The result will be = i i ai Now there will be ermutations π other than the identity such that = i π i aπi with π i > k where k aπ i. Tibério Caetano: Grahical Models Lecture -Bayesian Networks / 9

42 Refreshing Eercise Eercise Prove that the factorized form for the robability distribution of a Bayesian Network is indeed normalized to. Tibério Caetano: Grahical Models Lecture -Bayesian Networks / 9

43 Hidden CI Statements? We have obtained a BN by Introducing very convenient CI statements namely those that shrink the factors of the eansion = N i= π i <πi By doing so have we induced other CI statements? The answer is YES Tibério Caetano: Grahical Models Lecture -Bayesian Networks 4 / 9

44 Head-to-Tail Nodes Indeendence Are a and b indeendent? a c b Does a b hold? Check whether ab = ab ab = c abc = c ac ab c = a c b cc a = ab a ab Tibério Caetano: Grahical Models Lecture -Bayesian Networks 5 / 9

45 Head-to-Tail Nodes Cond. Inde. Factorization CI? a c b Does abc = ac ab c a b c? Assume abc = ac ab c holds Then ab c = abc c = ac ab c c = ca cb c c = a cb c Tibério Caetano: Grahical Models Lecture -Bayesian Networks 6 / 9

46 Head-to-Tail Nodes Cond. Inde. CI Factorization? a c b Does a b c abc = ac ab c? Assume a b c i.e. ab c = a cb c Then abc := ab cc = a cb cc Bayes = ac ab c Tibério Caetano: Grahical Models Lecture -Bayesian Networks 7 / 9

47 Tail-to-Tail Nodes Indeendence Are a and b indeendent? c a b Does a b hold? Check whether ab = ab ab = c abc = c ca cb c = ba cc b = ba b ab in general c Tibério Caetano: Grahical Models Lecture -Bayesian Networks 8 / 9

48 Tail-to-Tail Nodes Cond. Inde. Factorization CI? c a b Does abc = ca cb c a b c? Assume abc = ca cb c. Then ab c = abc c = ca cb c c = a cb c Tibério Caetano: Grahical Models Lecture -Bayesian Networks 9 / 9

49 Tail-to-Tail Nodes Cond. Inde. CI Factorization? c a b Does a b c abc = ca cb c? Assume a b c holds i.e. ab c = a cb c holds Then abc = ab cc = a cb cc Tibério Caetano: Grahical Models Lecture -Bayesian Networks 0 / 9

50 Head-to-Head Nodes Indeendence Are a and b indeendent? a b c Does a b hold? Check whether ab = ab ab = c abc = c abc ab = ab Tibério Caetano: Grahical Models Lecture -Bayesian Networks / 9

51 Head-to-head Nodes Cond. Inde. Factorization CI? a b c Does abc = abc ab a b c? Assume abc = abc ab holds ab c = abc c Then = abc ab c Tibério Caetano: Grahical Models Lecture -Bayesian Networks / 9 a cb c in general

52 CI Factorization in -Node BNs Therefore we conclude that Conditional Indeendence and Factorization are equivalent for the atomic Bayesian Networks with only nodes. Question Are they equivalent for any Bayesian Network? To answer we need to characterize which conditional indeendence statements hold for an arbitrary factorization and check whether a distribution that satisfies those statements will have such factorization. Tibério Caetano: Grahical Models Lecture -Bayesian Networks / 9

53 Blocked Paths We start by defining a blocked ath which is one containing An observed TT or HT node or A HH node which is not observed nor any of its descendants is observed a f a f e b e b c c Tibério Caetano: Grahical Models Lecture -Bayesian Networks 4 / 9

54 D-Searation A set of nodes Tyes A is of said Grahical to be d-searated Models from a set of nodes B by a set of nodes C if every ath from A to B is blocked when C is in the conditioning set. Directed Grahical Models: X 4 X X X 6 X X 5 Eercise: Is X d-searated from X 6 when the conditioning set is {X X 5 }? Tibério Caetano: Grahical Models Lecture -Bayesian Networks 5 / 9

55 CI Factorization for BNs Theorem: Factorization CI If a robability distribution factorizes according to a directed acyclic grah and if A B and C are disjoint subsets of nodes such that A is d-searated from B by C in the grah then the distribution satisfies A B C. Theorem: CI Factorization If a robability distribution satisfies the conditional indeendence statements imlied by d-searation over a articular directed grah then it also factorizes according to the grah. Tibério Caetano: Grahical Models Lecture -Bayesian Networks 6 / 9

56 Factorization CI for BNs Proof Strategy: DF d-se d-se: d-searation roerty DF: Directed Factorization Proerty Tibério Caetano: Grahical Models Lecture -Bayesian Networks 7 / 9

57 CI Factorization for BNs Proof Strategy: d-se DL DF DL: Directed Local Markov Proerty: α ndα aα Thus we obtain DF d-se DL DF Tibério Caetano: Grahical Models Lecture -Bayesian Networks 8 / 9

58 Relevance of CI Factorization for BNs Has local wants global CI statements are usually what is known by the eert The eert needs the model in order to comute things The CI Factorization art gives from what is known CI statements Tibério Caetano: Grahical Models Lecture -Bayesian Networks 9 / 9

59 Grahical Models Lecture 4 - Markov Random Fields Tibério Caetano tiberiocaetano.com Statistical Machine Learning Grou NICTA Canberra LLSS Canberra 009 Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields / 8

60 Changing the class of CI statements We obtained BNs by assuming πi <πi = πi aπi i where a πi < π i. We saw in general that such tyes of CI statements would roduce others and in general all CI statements can be read as d-searation in a DAG. However there are sets of CI statements which cannot be satisfied by any BN. Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields / 8

61 Markov Random Fields Ideally we would like to have more freedom There is another class of Grahical Models called Markov Random Fields MRFs MRFs allow for the secification of a different class of CI statements The class of CI statements for MRFs can be easily defined by grahical means in undirected grahs. Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields / 8

62 Grah Searation Definition of Grah Searation In an undirected grah G being A B and C disjoint subsets of nodes if every ath from A to B includes at least one node from C then C is said to searate A from B in G. A C B Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 4 / 8

63 Grah Searation Definition of Markov Random Field An MRF is a set of robability distributions { : > 0 } such that there eists an undirected grah G with disjoint subsets of nodes A B C in which whenever C searates A from B in G A B C in Colloquially we say that the MRF is such undirected grah. But in reality it is the set of all robability distributions whose conditional indeendency statements are recisely those given by grah searation in the grah. Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 5 / 8

64 Cliques and Maimal Cliques Definitions concerning undirected grahs A clique of a grah is a comlete subgrah of it i.e. a subgrah where every air of nodes is connected by an edge. A maimal clique of a grah is clique which is not a roer subset of another clique 4 {X X } form a clique and {X X X 4 } a maimal clique. Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 6 / 8

65 Factorization Proerty Definition of factorization w.r.t. an undirected grah A robability distribution is said to factorize with resect to a given undirected grah if it can be written as = Z c C c c where C is the set of maimal cliques c is a maimal clique c is the domain of restricted to c and c c is an arbitrary non-negative real-valued function. Z ensures =. Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 7 / 8

66 CI Factorization for ositive MRFs Theorem: Factorization CI If a robability distribution factorizes according to an undirected grah and if A B and C are disjoint subsets of nodes such that C searates A from B in the grah then the distribution satisfies A B C. Theorem: CI Factorization Hammersley-Clifford If a strictly ositive robability distribution > 0 satisfies the conditional indeendence statements imlied by grah searation over a articular undirected grah then it also factorizes according to the grah. Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 8 / 8

67 Factorization CI for MRFs Proof... Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 9 / 8

68 CI Factorization for +MRFs H-C Thm Möbius Inversion: for C B A S and F : PS R: FA = B:B A C:C B B C F C Define F = φ = log and comute the inner sum for the case where B is not a clique i.e. X X not connected in B. Then CI φx C X + φc = φc X + φc X holds and = B C φc = C B C B;X X / C C B;X X / C C B;X X / C C B;X X / C C B;X X / C B C φc + B C X φc X + B C X φc X + B X C X φx C X = B C [φx C X + φc φc X φc X ] }{{} Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 0 / 8 =0

69 Relevance of CI Factorization for MRFs Relevance is analogous to the BN case CI statements are usually what is known by the eert The eert needs the model in order to comute things The CI Factorization art gives from what is known CI statements Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields / 8

70 Comarison BNs vs. MRFs In both tyes of Grahical Models A relationshi between the CI statements satisfied by a distribution and the associated simlified algebraic structure of the distribution is made in term of grahical objects. The CI statements are related to concets of searation between variables in the grah. The simlified algebraic structure factorization of in this case is related to local ieces of the grah child + its arents in BNs cliques in MRFs Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields / 8

71 Comarison BNs vs. MRFs Differences The set of robability distributions that can be reresented as MRFs is different from the set that can be reresented as BNs. Although both MRFs and BNs are eressed as a factorization of local functions on the grah the MRF has a normalization constant Z = c C c c that coules all factors whereas the BN has not. The local ieces of the BN are robability distributions themselves whereas in MRFs they need only be non-negative functions i.e. they may not have range [0 ] as robabilities do. Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields / 8

72 Comarison BNs vs. MRFs Eercises When are the CI statements of a BN and a MRF recisely the same? A grah has nodes A B and C. We know that A B but C A and C B both do not hold. Can this reresent a BN? An MRF? Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 4 / 8

73 I-Mas D-Mas and P-Mas A grah is said to be a D-ma for deendence ma of a distribution if every conditional indeendence statement satisfied by the distribution is reflected in the grah. A grah is said to be an I-ma for indeendence ma of a distribution if every conditional indeendence statement imlied by the grah is satisfied in the distribution. A grah is said to be an P-ma for erfect ma of a distribution if it is both a D-ma and an I-ma for the distribution. Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 5 / 8

74 I-Mas D-Mas and P-Mas D U P D: set of distributions on n variables that can be reresented as a erfect ma by a DAG U: set of distributions on n variables that can be reresented as a erfect ma by an Undirected grah P: set of all distributions on n variables Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 6 / 8

75 Markov Blankets i The Markov Blanket of a node X i in either a BN or an MRF is the smallest set of nodes A such that i ĩ = i A BN: arents children and co-arents of the node MRF: neighbors of the node Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 7 / 8

76 Eercises Eercises Show that the Markov Blanket of a node i in a BN is given by it s children arents and co-arents Show that the Markov Blanket of a node i in a MRF is given by its neighbors Tibério Caetano: Grahical Models Lecture 4 - Markov Random Fields 8 / 8

77 Grahical Models Lecture 5 - Inference Tibério Caetano tiberiocaetano.com Statistical Machine Learning Grou NICTA Canberra LLSS Canberra 009 Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

78 Factorized Distributions Our as a factorized form For BNs we have = i i a i for MRFs we have = Z c C c c Will this enable us to answer the relevant questions in ractice? Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

79 Key Concet: Distributive Law Distributive Law ab }{{ + ac } = ab + c }{{} oerations oerations I.e. if the same constant factor a here is resent in every term we can gain by ulling it out Consider comuting the marginal for the MRF with factorization N = Z i= i i+ Eercise: which grah is this? = N... N Z i= i i+ = Z N N N O N i= X i vs. O i=n i= X i X i+ Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

80 Elimination Algorithm Distributive Law DL is the key to efficient inference in GMs The simlest algorithm using the DL is the Elimination Algorithm This algorithm is aroriate when we have a single query Just like in the revious eamle of comuting in a givel MRF This algorithm can be seen as successive elimination of nodes in the grah Tibério Caetano: Grahical Models Lecture 5 - Inference 4 / 9

81 X X 5 Elimination Algorithm XX X X 5 5 a a a X X X X X X X 5 X X 5 5 b b b X X X X X 4 X 4 XX4 4 X 4 X X 4X 4 4 X X 5 X 5 X X X XX X X X X c X X X c d d X 6 X X 6 X X X X X X X X X X X X X X X X 5 X X d X b X 5 Comute Xwith 4 elimination order X X X X X X X X X X X X = Z P X X X XX X c c c XX5 5 X 5 X X f f f g g g X k# h8 f # $ &'$ &$ ' #" &'$ $ " {z a } h id8 D e X d d d X X X f g k# h8 f # $ &'$ &$ ' #" &'$ $ " k# h8$ f & # $ &' $! $ ad8 &$ ' #" &'$ $ " $ & $! ad8 = Z P P P 4 4 X 5 5 X k# h8 f # # $ $ &'$ &$ &$ ' ' #" #" &'$ $ $ " " a a h h id8 id8 D D $ & $ & $! $! ad8 ad8 m 6 5 {z } = Z sum P m 5 X 4 4 m 5 e e e f g f {z } g &$ ' #" &'$ $ " a h id8 D m 4 = Z X m 4 X m 5 {z } m {z } m Tibério Caetano: Grahical Models Lecture 5 - Inference 5 / 9

82 Belief Proagation Belief Proagation Algorithm also called Probability Proagation Sum-Product Algorithm Does not reeat comutations Is secifically targeted at tree-structured grahs Tibério Caetano: Grahical Models Lecture 5 - Inference 6 / 9

83 Belief Proagation in a Chain µ α n µ α n µ β n µ β n+ n n n+ N N Z n <n >n n = <n >n n = Z [ n = Z <n n = Z i= i i+ i= i i+ N i=n i i+ ] [ >n n i= i i+ [ ] n i i+ <n i= }{{} µ α n=o P i=n i= X i X i+ [ >n N Tibério Caetano: Grahical Models Lecture 5 - Inference 7 / 9 i=n N i=n i i+ i i+ ] }{{} µ β n=o P N i=n X i X i+ ]

84 Belief Proagation in a Chain So in order to comute n we only need the incoming messages to n But n is arbitrary so in order to answer an arbitrary query we need an arbitrary air of incoming messages So we need all messages To comute a message to the right left we need all revious messages coming from the left right So the rotocol should be: start from the leaves u to n then go back towards the leaves Chain with N nodes N messages to be comuted Tibério Caetano: Grahical Models Lecture 5 - Inference 8 / 9

85 Comuting Messages Defining trivial messages: m 0 := m N+ N := For i = to N comute m i i = i i i m i i For i = N back to comute m i+ i = i+ i i+ m i+ + Tibério Caetano: Grahical Models Lecture 5 - Inference 9 / 9

86 Belief Proagation in a Tree The reason why things are so nice in the chain is that every node can be seen as a leaf after it has received the message from one side i.e. after the nodes from which the message come have been eliminated Original Leaves give us the right lace to start the comutations and from there the adjacent nodes become leaves as well However this roerty also holds in a tree Tibério Caetano: Grahical Models Lecture 5 - Inference 0 / 9

87 Belief Proagation in a Tree Message Passing Equation m j i = j j i k:k jk i m k j k:k jk i m k j := whenever j is a leaf Comuting Marginals i = j:j i m j i Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

88 Ma-Product Algorithm There are imortant queries other than comuting marginals. For eamle we may want to comute the most likely assignment: = argma as well as its robability one ossibility would be to comute i = for all i then ĩ i = argma i i and then simly =... N What s the roblem with this? Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

89 Eercises Eercise Construct with {0 } such that = 0 where i = argma i i Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

90 Ma-Product and Ma-Sum Algorithms Instead we need to comute directly = argma... N... N We can use the distributive law again since maab ac = a mab c for a > 0 Eactly the same algorithm alies here with ma instead of : ma-roduct algorithm. To avoid underflow we comute via logargma = argma log = argma s log f s s since log is a monotonic function. We can still use the distributive law since ma + is also a commutative semiring i.e. maa + b a + c = a + mab c Tibério Caetano: Grahical Models Lecture 5 - Inference 4 / 9

91 A Detail in Ma-Sum After comuting the ma-marginal for the root : i = ma i s µ f s and its maimizer i = argma i i It s not a good idea simly to ass back the messages to the leaves and then terminate Why? In such cases it is safer to store the maimizing configurations of revious variables with resect to the net variables and then simly backtrack to restore the maimizing ath. In the articular case of a chain this is called Viterbi algorithm an instance of dynamic rogramming. Tibério Caetano: Grahical Models Lecture 5 - Inference 5 / 9

92 Arbitrary Grahs X 4 X X X 6 X X 5 Elimination algorithm is needed to comute marginals Tibério Caetano: Grahical Models Lecture 5 - Inference 6 / 9

93 A Problem with the Elimination Algorithm Elimination: eamle How to comute X 4 6 X X X 6 X X 5 Tibério Caetano: Grahical Models Lecture 5 - Inference 7 / 9

94 A Problem with the Elimination Algorithm Elimination m Z m m Z m m Z m Z m Z Z Z = = = = = = = δ δ Tibério Caetano: Grahical Models Lecture 5 - Inference 8 / 9

95 A Problem with the Elimination Algorithm Elimination = 6 m Z 6 m Z = = 6 m m Tibério Caetano: Grahical Models Lecture 5 - Inference 9 / 9

96 A Problem with the Elimination Algorithm Elimination: eamle What if now we want to comute 6 X X 4 X X 6 X X 5 Tibério Caetano: Grahical Models Lecture 5 - Inference 0 / 9

97 A Problem with the Elimination Algorithm Elimination: eamle m Z m Z m m Z m Z m Z Z Z = = = = = = = δ δ Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

98 A Problem with the Elimination Algorithm Elimination: eamle m Z m Z m m Z m Z m Z Z Z = = = = = = = δ δ m Z m m Z m m Z m Z m Z Z Z = = = = = = = δ δ Reeated Comutations!! 6 6 How to avoid that? Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

99 The Junction Tree Algorithm The Junction Tree Algorithm is a generalization of the belief roagation algorithm for arbitrary grahs In theory it can be alied to any grah DAG or undirected However it will be efficient only for certain classes of grahs Tibério Caetano: Grahical Models Lecture 5 - Inference / 9

100 Chordal Grahs Chordal Grahs also called triangulated grahs The JT algorithm runs on chordal grahs A chord in a cycle is an edge connecting two nodes in the cycle but which does not belong to the cycle i.e. a shortcut in the cycle Triangulation A grah is chordal if every cycle of length greater than The first ste in the algorithm is to triangulate the grah by roer insertion of edges. has a chord. A B A B C D C D not chordal E F E F chordal Tibério Caetano: Grahical Models Lecture 5 - Inference 4 / 9

101 Chordal Grahs What if a grah is not chordal? Add edges until it becomes chordal This will change the grah Eercise: Why is this not a roblem? Tibério Caetano: Grahical Models Lecture 5 - Inference 5 / 9

102 Triangulation Ste Triangulate the grah if it s not triangulated X 4 X X X 6 X X 5 Tibério Caetano: Grahical Models Lecture 5 - Inference 6 / 9

103 Junction Tree Construction The Junction Tree Algorithm Create a Junction Tree X 4 X X X X X X X X 5 X X X 6 X X 5 X X 5 X X 4 X X X 5 X 6 Tibério Caetano: Grahical Models Lecture 5 - Inference 7 / 9

104 Initialization The Junction Tree Algorithm Initialize clique otentials nodes and searators Ψc Directly introduced Φ s Initialized to Ψ 4 Ψ Ψ 5 X X X X X X X X 5 Φ5 = ones S S X X 4 Φ = ones S S Φ = ones S X X X 5 X X 5 X 6 Ψ 56 Tibério Caetano: Grahical Models Lecture 5 - Inference 8 / 9

105 Proagation Ψ ** Φ Ψ * * 4 Φ ** The Junction Tree Algorithm 4 Message assing Φ = Φ ** * = Ψ Φ = Φ * = Ψ 4 Ψ Ψ 4 * 4 Φ Ψ Φ Ψ Φ Ψ ** * 5 * 5 * 56 * * 56 = Ψ Φ = Φ * = Ψ Φ = Φ = Ψ Φ = Φ Ψ * 5 * 5 5 Ψ * 56 ** * ** 5 Ψ Ψ ** 5 Φ ** 5 Φ = Φ = Ψ 6 X X X X X X X X 5 ** 5 * 5 Ψ ** 56 * 5 X X 4 X X X 5 X X 5 X 6 Tibério Caetano: Grahical Models Lecture 5 - Inference 9 / 9

106 Grahical Models Lecture 6 - Learning Tibério Caetano tiberiocaetano.com Statistical Machine Learning Grou NICTA Canberra LLSS Canberra 009 Tibério Caetano: Grahical Models Lecture 6 - Learning / 0

107 Learning We saw that Given ; θ Probabilistic Inference consists of comuting Marginals of ; θ Conditional distributions MAP configurations etc. However what is ; θ in the first lace? Finding ; θ from data is called Learning or Estimation or Statistical Inference. Tibério Caetano: Grahical Models Lecture 6 - Learning / 0

108 Maimum Likelihood Estimation In the case of Grahical Models we ve seen that ; θ = Z s S f s s ; θ s where {s} are subsets of random variables and {f s } are non-negative real-valued functions. We can re-write that as ; θ = e s S log f s s ; θ s gθ where gθ = log e s S log f s s ; θ s Tibério Caetano: Grahical Models Lecture 6 - Learning / 0

109 Data IID assumtion We observe data X = {X... X m } We assume every X i is a samle from the same unknown distribution ; θ S identical assumtion We assume X i and X j i j to be drawn indeendently from ; θ S indeendence assumtion This is the iid setting indeendently and identically distributed Tibério Caetano: Grahical Models Lecture 6 - Learning 4 / 0

110 Maimum Likelihood Estimation The joint robability of observing the data X is thus X; θ = i i ; θ = i e s log f s i s; θ s gθ Seen as a function of θ this is the likelihood function. The negative log-likelihood is log X; θ = mgθ m i= s log f ss; i θ s Tibério Caetano: Grahical Models Lecture 6 - Learning 5 / 0

111 Maimum Likelihood Estimation Maimum likelihood estimation consists of finding θ that maimizes the likelihood function or minimizes the negative log-likelihood: ] m θ = argmin θ [mgθ log f s s; i θ s i= s } {{ } :=lθ;x In order to minimize it we must have θ lθ; X = 0 and therefore each θs lθ; X = 0 s. What haens for both BNs and MRFs? Tibério Caetano: Grahical Models Lecture 6 - Learning 6 / 0

112 ML Estimation in BNs For BNs gθ = 0 Eercise and s f s s ; θ s = s so θs [ mgθ m log f s s; i θ s + s i= m θs f s s i ; θ s = λ s i= s s θs f s s ; θ s s λ s s f s s ; θ s Therefore we have S equations where every air can be solved indeendently for θ s and λ s So the ML estimation roblem decoules on local ML estimation roblems involving only the variables in each individual set s. Tibério Caetano: Grahical Models Lecture 6 - Learning 7 / 0 ] =

113 ML Estimation in MRFs For MRFs gθ 0 so θs [mgθ m θs gθ ] m log f s s; i θ s = 0 i= s m θs f s s; i θ s = 0 s i= s Therefore we have S equations that cannot be solved indeendently since gθ involves all s. This may give rise to a comle non-linear system of equations. So learning in MRFs is more difficult than in BNs. Tibério Caetano: Grahical Models Lecture 6 - Learning 8 / 0

114 Eonential Families Consider the arameterized family of distributions ; θ = e Φ θ gθ Such a family of distributions is called an Eonential Family Φ is the sufficient statistics θ is the natural arameter gθ = log e Φ θ is the log-artition function This is the form of several distributions of interest like Gaussian binomial multinomial Poisson gamma Rayleigh beta etc. Tibério Caetano: Grahical Models Lecture 6 - Learning 9 / 0

115 Eonential Families If we assume that our ; θ is an eonential family the learning roblem becomes articularly convenient because it becomes conve Why? Recall the form of ; θ for a grahical model ; θ = e s S log f s s ; θ s gθ for it to be an eonential family we need s S log f s s ; θ s = Φ θ = s Φ s s θ s Tibério Caetano: Grahical Models Lecture 6 - Learning 0 / 0

116 Eonential Families for MRFs For MRFs the negative log-likelihood now becomes m log X; θ =mgθ Φs s i θ s i= =mgθ m s where we defined µ s s := m i= Φ s s /m s µ s s θ s Taking the gradient and setting to zero we have θs mgθ m s µ s s θ s = 0 θs gθ = µ s s but θs gθ = E ;θ [Φ s s ]? so E ;θ [Φ s s ] = µ s s Tibério Caetano: Grahical Models Lecture 6 - Learning / 0

117 Eonential Families for MRFs In other words: The ML estimate θ must be such that the eected value of the sufficient statistics under ; θ for every clique has to match the samle average for the clique. Why is the roblem conve? Eercise Tibério Caetano: Grahical Models Lecture 6 - Learning / 0

118 Eonential Families for BNs For BNs the negative log-likelihood now becomes m log X; θ = Φs s i θ s i= = m s s µ s s θ s where we also defined µ s s := m i= Φ s i s/m. Constructing the Lagrangian corresonding to the constraints s e Φ s θ s = s and taking the gradient equal to zero we have m µ s s = λ s E s s ;θ s [Φ s s ] s which can be solved for θ s and λ s using e Φ s θ s = Tibério Caetano: Grahical Models Lecture 6 - Learning / 0 s

119 Eamle: Discrete BNs Multinomial random variables Tabular reresentation for v av define φ v := v av One arameter θ v associated to each v av i.e. θ v φv := v av ; θ v Note that there are no constraints beyond the normalization constraint The joint is V θ = v θ v φv The likelihood is then log X; θ = log n Vn θ continuing... Tibério Caetano: Grahical Models Lecture 6 - Learning 4 / 0

120 Eamle: Discrete BNs log X; θ = log Vn θ n = log V ; θ δ V Vn n V = δ V Vn log V ; θ n V = m V log V ; θ V = V m V log v θ v φv = m V log θ v φv V v = m V log θ v φv v φv V\φv = m φv log θ v φv v φv Tibério Caetano: Grahical Models Lecture 6 - Learning 5 / 0

121 Eamle: Discrete BNs The Lagrangian is Lθ λ = m φv log θ v φv + v v λ v v θ v φv φ v and θv φv Lθ λ = m φ v θ v φv λ v = 0 λ v = m φ v θ v φv but since v θ v φv = λ v = m av so θ v φv = m φ v m av Matches intuition Tibério Caetano: Grahical Models Lecture 6 - Learning 6 / 0

122 Learning the Potentials ML Estimation First How to learn the otential functions when we have observed data for all variables in the model? Second How to learn the otential functions when there are latent hidden variables i.e. we do not observe data for them? Tibério Caetano: Grahical Models Lecture 6 - Learning 7 / 0

123 Learning the Potentials Comletely Observed GMs V Ψ Z X X X = Ψ Assume we observe N instances of this model For IID samling the sufficient statistics are the emirical marginals ~ ~ and How do we estimate Ψ and Ψ from the sufficient statistics? Tibério Caetano: Grahical Models Lecture 6 - Learning 8 / 0

124 Learning the Potentials Comletely Observed GMs Let s make a guess: ~ ~ ~ ˆ ML = so that We can verify that our guess is good because: ~ ˆ ML = Ψ ~ ~ ˆ ML = Ψ = = ~ ~ ~ ~ ˆ ML = = ~ ~ ~ ~ ˆ ML Tibério Caetano: Grahical Models Lecture 6 - Learning 9 / 0

125 Learning the Potentials Comletely Observed GMs The general recie is: For every maimal clique C set the clique otential to its emirical marginal For every intersection S between maimal cliques associate an emirical marginal with that intersection and divide it into the otential of ONE of the cliques that form the intersection This will give ML estimates for decomosable Grahical Models Tibério Caetano: Grahical Models Lecture 6 - Learning 0 / 0

126 that are not adjacent in the cycle are not neighbors that is any {v a v b } with a b is not in E. That is there is no chord or shortcut for the cycle. Decomosable A grah is chordal also Grahs called triangulated if it contains no chordless cycles of length greater than. A grah is comlete if E contains all airs of distinct elements of V. A grah G = V E is decomosable if either. G is comlete or. We can eress V as V = A B C where a A B and C are disjoint b A and C are non-emty c B is comlete d B searates A and C in G and e A B and B C are decomosable. A ath is a sequence v... v n of distinct vertices for which all {v i v i+ } E. Afortree decomosable is an undirectedgrahs for thewhich derivative every air of the of vertices log-artition is connected by recisely one ath. A clique of a grah G is a subset of vertices that are comletely function gθ connected. decoules A maimal over clique theof cliques G is a clique Eercise for which every MRF suerset of learning vertices of easy. G is not a clique. A clique tree for a grah G = V E is a tree T = V T E T where V T is a set of cliques of G that contains all maimal cliques of G. Tibério Caetano: Grahical Models Lecture 6 - Learning / 0

127 Learning the Potentials Comletely Observed GMs Non-decomosable Grahical Models: An iterative rocedure must be used: Iterative Proortional Fitting IPF: ~ C C C C C C Ψ = Ψ ~ C t C C t C C t C Ψ = Ψ + Where it can be shown that: ~ C C t = + Tibério Caetano: Grahical Models Lecture 6 - Learning / 0

128 EM Algorithm Hidden variables: EM algorithm How to estimate the otentials when there are unobserved variables? X X X Answer: EM algorithm Tibério Caetano: Grahical Models Lecture 6 - Learning / 0

129 EM Algorithm Hidden variables: EM algorithm Denote the observed variables by X and the hidden variables by Z X Z X X X If we knew Z the roblem would reduce to maimizing the comlete log-likelihood: l c θ; z = log z θ Tibério Caetano: Grahical Models Lecture 6 - Learning 4 / 0

130 EM Algorithm Hidden variables: EM algorithm However we don t observe Z so the robability of the data X is l θ; = log θ = log z θ Which is the incomlete log-likelihood z This is the quantity we really want to maimize Note that now the logarithm cannot transform the roduct into a sum since it is blocked by the sum over Z and the otimization does not decoule Tibério Caetano: Grahical Models Lecture 6 - Learning 5 / 0

131 EM Algorithm Hidden variables: EM algorithm The basic idea of the EM algorithm is: Given that Z is not observed we may try to otimize an averaged version over all ossible values of Z of the comlete log-likelihood We do that through an averaging distribution q: l θ z = q z θ log z θ c q z And obtain the eected comlete log-likelihood The hoe then is that maimizing this should at least imrove the current estimate for the arameters so that iteration would eventually maimize the log-likelihood Tibério Caetano: Grahical Models Lecture 6 - Learning 6 / 0

132 EM Algorithm Hidden variables: EM algorithm In order to resent the algorithm we first note that: : log log ; log ; log ; θ θ θ θ θ θ θ θ q L z q z z q z q z z q l z l l z z z = = = = Where L is the auiliary function. The EM algorithm is coordinate-ascent on L Tibério Caetano: Grahical Models Lecture 6 - Learning 7 / 0

133 Hidden variables: EM algorithm EM Algorithm The EM algorithm E - ste q = arg ma L q θ t+ t q M - ste θ t+ = arg ma L q θ t+ θ Tibério Caetano: Grahical Models Lecture 6 - Learning 8 / 0

134 EM Algorithm Hidden variables: EM algorithm Note that the M ste is equivalent to maimizing the eected comlete loglikelihood: = = = z q c z z z z q z q z l q L z q z q z z q q L z q z z q q L log ; log log log θ θ θ θ θ θ Because the second term does not deend on θ Tibério Caetano: Grahical Models Lecture 6 - Learning 9 / 0

135 EM Algorithm The general solution to the E ste turns out to be Because t t z z q θ = + ; log log log l z L z L z z L z z z z L t t t t t t t z t t t t t z t t t θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ = = = = Tibério Caetano: Grahical Models Lecture 6 - Learning 0 / 0

Introduction to Probability for Graphical Models

Introduction to Probability for Grahical Models CSC 4 Kaustav Kundu Thursday January 4, 06 *Most slides based on Kevin Swersky s slides, Inmar Givoni s slides, Danny Tarlow s slides, Jaser Snoek s slides,