Alternative Parameterizations of Markov Networks. Sargur Srihari

Similar documents
Alternative Parameterizations of Markov Networks. Sargur Srihari

Undirected Graphical Models: Markov Random Fields

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Probabilistic Graphical Models

MAP Examples. Sargur Srihari

Variable Elimination: Algorithm

Variable Elimination: Algorithm

Inference as Optimization

Learning MN Parameters with Approximation. Sargur Srihari

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

CSC 412 (Lecture 4): Undirected Graphical Models

3 : Representation of Undirected GM

Learning Parameters of Undirected Models. Sargur Srihari

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Using Graphs to Describe Model Structure. Sargur N. Srihari

Intelligent Systems (AI-2)

Lecture 6: Graphical Models

Notes on Markov Networks

Representation of undirected GM. Kayhan Batmanghelich

Exact Inference: Clique Trees. Sargur Srihari

Probabilistic Graphical Models

From Bayesian Networks to Markov Networks. Sargur Srihari

From Distributions to Markov Networks. Sargur Srihari

Rapid Introduction to Machine Learning/ Deep Learning

Chris Bishop s PRML Ch. 8: Graphical Models

Need for Sampling in Machine Learning. Sargur Srihari

Chapter 16. Structured Probabilistic Models for Deep Learning

Undirected Graphical Models

CS Lecture 4. Markov Random Fields

CPSC 540: Machine Learning

Inference in Bayesian Networks

Lecture 9: PGM Learning

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Directed Graphical Models or Bayesian Networks

Structured Variational Inference

Undirected Graphical Models

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Markov networks: Representation

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Directed and Undirected Graphical Models

Bayesian Networks (Part II)

Machine Learning 4771

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Graphical Models and Kernel Methods

Introduction to Probabilistic Graphical Models

Learning Parameters of Undirected Models. Sargur Srihari

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Machine Learning Lecture 14

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Intelligent Systems (AI-2)

4 : Exact Inference: Variable Elimination

Conditional Independence and Factorization

Probabilistic Graphical Models (I)

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Undirected graphical models

Probabilistic Graphical Models

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Intelligent Systems:

Bayesian Learning in Undirected Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Markov Networks.

Computational Complexity of Inference

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Lecture 4 October 18th

Directed and Undirected Graphical Models

Variable Elimination (VE) Barak Sternberg

Variational Inference. Sargur Srihari

Example: multivariate Gaussian Distribution

A brief introduction to Conditional Random Fields

Probabilistic Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

6.867 Machine learning, lecture 23 (Jaakkola)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Review: Directed Models (Bayes Nets)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

STA 4273H: Statistical Machine Learning

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Probabilistic Graphical Models

Variational Inference (11/04/13)

An Introduction to Bayesian Machine Learning

The Ising model and Markov chain Monte Carlo

Intelligent Systems (AI-2)

4.1 Notation and probability review

Markov Random Fields

Bayesian Networks Inference with Probabilistic Graphical Models

Junction Tree, BP and Variational Methods

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Markov Random Fields for Computer Vision (Part 1)

Multivariate Gaussians. Sargur Srihari

Variable Elimination: Basic Ideas

Sequential Supervised Learning

Directed Graphical Models

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Probabilistic Graphical Models Lecture Notes Fall 2009

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

COMPSCI 276 Fall 2007

Undirected Graphical Models

Transcription:

Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1

Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions 4. Log-linear with Features Ising, Boltzmann Overparameterization Canonical Parameterization Eliminating Redundancy 2

Gibbs Parameterization A distribution P Φ is a Gibbs distribution parameterized by a set of factors { } Φ = φ 1 (D 1 ),..,φ K (D K ) If it is defined as follows P Φ (X 1,..X n )= 1 Z P(X 1,..X n ) D i are sets of random variables A factor ϕ is a function from Val(D) to R where Val is the set of values that D can take where P(X 1,..X n ) = m φ i (D i ) i=1 is an unnomalized measure and Z = P(X 1,..X n ) X 1,..X n Factor returns a potential The factors do not necessarily represent the marginal distributions p(d i ) of the variables in their scopes 3

Gibbs Parameters with Pairwise Factors Alice Debbie Alice & Debbie Study together A Alice & Bob are friends D B Bob Debbie & Charles Ague/Disagree Charles C Bob & Charles Study together Note that Factors are Non-negative (a) P(a,b,c,d) = 1 Z φ 1 (a,b) φ 2 (b,c) φ 3 (c,d) φ 4 (d,a) where Z = a,b,c,d φ 1 (a,b) φ 2 (b,c) φ 3 (c,d) φ 4 (d,a) 4

Two Gibbs parameterizations, same MN structure Gibbs distribution P over fully connected graph 1. Clique potential parameterization Entire graph is a clique P(a, b,c, d) = 1 φ(a, b,c, d) where Z = φ (a, b,c, d) Z No of Parameters» Exponential in no. of variables: 2 n -1 2. Pairwise parameterization A factor for each pair of variables X,Y ε χ P(a,b,c,d) = 1 Z φ (a,b) φ (b,c) φ (c,d) φ (d,a) φ (a,c) φ 1 2 3 4 5 6 (b,d) where Z = φ a,b,c,d a,b,c,d Completely connected graph with four binary variables (a,b) φ (b,c) φ (c,d) φ (d,a) 1 2 3 4 φ 5 (a,c) φ 6 (b,d) Quadratic no of parameters: 4 x n C 2 Independencies are same in both But significant difference in no of parameters 5

Shortcoming of Gibbs Parameterization Network structure doesn t reveal parameterization Cannot tell whether the factors are maximal cliques or subsets Example next 6

Factor Graphs Markov network structure does not reveal all structure in a Gibbs parameterization Cannot tell from graph whether factors involve maximal cliques or their subsets Factor graph makes parameterization explicit 7

Factor Graph Undirected graph with two types of nodes Variable nodes denoted as ovals Factor nodes denoted as squares Contains edges only between variable nodes and factor nodes x 1 x 2 x 1 x 2 x 1 x 2 f f a Markov Network of three variables x 3 Single Clique potential x 3 P(x 1, x 2, x 3 ) = 1 Z f (x 1, x 2, x 3 ) Two Clique potentials x 3 f b P(x 1, x 2, x 3 ) = 1 Z f a(x 1, x 2, x 3 ) f b (x 2, x 3 ) where Z = f (x 1, x 2, x 3 ) x1,x2,x 3 where Z = f a (x 1, x 2, x 3 ) f b (x 2, x 3 ) x1,x2,x 3

Parameterization of Factor Graphs MN parameterized by a set of factors Each factor node V φ is associated with only one factor φ whose scope is the set of variables that are neighbors of V φ A distribution P factorizes over Factor graph F if it can be represented as a set of factors in this form

Multiple factor graphs for same graph Factor graphs are specific about factorization A fully connected undirected graph Joint distribution in two forms In general form p(x)=f(x 1,x 2,x 3 ) As a specific factorization p(x)=f a (x 1,x 2 )f b (x 1,x 3 )f c (x 2,x 3 ) 10

Factor graphs for same network B V f1 B V f2 A C A C B V f V f3 A C (a) (b) (c) Single factor over all variables Three pairwise factors Induced Markov network for both is a clique over A,B,C Factor graphs (a) and (b) imply the same Markov network (c) Factor graphs make explicit the difference in factorization 11

Social Network Example Factor graph with pairwise and Higher-order factors 12

Factor graphs properties They are bipartite since 1. Two types of nodes 2. All links go between nodes of opposite type Representable as two rows of nodes Variables on top Factor nodes at bottom Other intuitive representations used When derived from directed/ undirected graphs 13

Deriving factor graphs from Graphical Models Undirected Graph (MN) Directed Graph (BN) 14

Conversion of MN to Factor Graph Steps in converting distribution expressed as undirected graph: 1. Create variable nodes corresponding to nodes in original 2. Create factor nodes for maximal cliques x s 3. Factors f s (x s ) set equal to clique potentials Several different factor graphs possible from same distribution Single clique potential Y(x 1,x 2,x 3 ) With factors f(x 1,x 2,x 3 )=Y(x 1, x 2, x 3 ) With factors 15 f a (x 1,x 2,x 3 )f b (x 2,x 3 )=Y(x 1,x 2,x 3 )

Conversion of BN to factor graph Steps 1. Variable nodes correspond to nodes in factor graph 2. Create factor nodes corresponding to conditional distributions Multiple factor graphs possible from same graph Factorization p(x 1 )p(x 2 )p(x 3 x 1,x 2 ) Single Factor f(x 1,x 2,x 3 )= p(x 1 )p(x 2 )p(x 3 x 1,x 2 ) With three Factors f a (x 1 )=p(x 1 ) f b (x 2 )=p(x 2 ) 16 f c (x 1,x 2,x 3 )=p(x 3 x 1,x 2 )

Tree to Factor Graph Conversion of directed or undirected tree to factor graph is a tree No loops Only one path between 2 nodes In the case of a directed polytree Conversion to undirected graph has loops due to moralization Conversion again to factor graph results in a tree Directed polytree Converted to Undirected Graph with loops Factor Graphs 17

Removal of local cycles Local cycles in a directed graph having links connecting parents Can be removed on conversion to factor graph By defining a factor function Factor Graph with tree structure f(x 1,x 2,x 3 )= p(x 1 ) p(x 2 x 1 ) p(x 3 x 1, x 2 ) 18

Log-linear Models As in Gibbs parameterization, Factor graphs still encode factors as tables over its scope P(x 1, x 2, x 3 ) = 1 Z f a(x 1, x 2, x 3 ) f b (x 2, x 3 ) where Z = f a (x 1, x 2, x 3 ) f b (x 2, x 3 ) x1,x2,x 3 Factors can also exhibit Context-specific structure (as in BNs) Patterns more readily seen in log-space 19

A factor Conversion to log-space φ(d) is a function from Val(D) to R where Val is the set of values that D can take Rewrite factor φ(d) φ(d) = exp( ε(d)) as Where ε(d) is the energy function defined as ε(d) = lnφ(d) Note that if ln a = -b then a=exp(-b) If we have a=φ(d)=exp(-ε(d)) then b=-ln a=ε(d) Thus factor value φ(d), a probability, is negative exponential of energy value ε(d) 20

Energy: Terminology of Physics Higher energy states have lower probability D is a set of atoms with their values being states and ε(d) is its energy, a scalar Probability φ(d) of a physical state depends inversely on its energy φ(d) = exp( ε(d)) = 1 exp(ε(d)) Log linear is term used in field of statistics for logarithms of cell frequencies ε(d) = lnφ(d) 21

Probability in logarithmic representation P Φ (X 1,..X n )= 1 P(X 1,..X n ) Z where P(X 1,..X n ) = φ i (D i ) m i=1 is an unnomalized measure and Z = P(X1,..X n ) X 1,..X n # m & P(X 1,.., X n ) α exp% ε i (D i )( $ ' i=1 Since φ i (D i ) = exp( ε i (D i )) where ε i (D i ) = ln(φ i (D i )) Taking logarithm requires that ϕ(d) be positive Note that probability is positive Log-linear parameters ε(d) can be any value along the real line Not just non-negative as with factors Any Markov network parameterized using positive factors can be converted into a log-linear representation 22

Partition Function in Physics P(X 1,.., X n ) = 1 Z where m Z = exp ε i (D i ) i=1 X 1,..X n m exp ε i (D i ) i=1 Z describes the statistical properties of a system in thermodynamic equilibrium They are functions of thermodynamic state variables such as temperature and volume it encodes how probabilities are partitioned among different microstates, based on their individual energies Z for Zustandssumme, "sum over states" 23

Log-linear Parameterization To convert factors in Gibbs parameterization to log-linear form: Take negative natural logarithm of each potential Requires potential to be positive (to take logarithm) - ln 30 = - 3.4 ε(d) = lnφ(d) Thus φ(d) = exp( ε(d)) Energy is Negative log probability 24

Example # m & P(X 1,.., X n ) α exp% ε i (D i )( $ ' i=1 P(A, B,C, D) α exp[ ε 1 (A, B) ε 2 (B,C) ε 3 (C, D) ε 4 (D, A) ] Partition function Z is sum of RHS over all values of A,B,C,D Product of Factors becomes sum of exponentials, e.g., φ(a, B) = exp( ε(a, B)) 25

From Potentials to Energy Functions Factors: Edge Potentials a 0 = has misconception a 1 = no misconception b 0 = has misconception b 1 = no misconception Take negative natural logarithm P(A, B,C, D) α exp ε i (D i ) i=1,2,3,4 ( ) Factors: Energy Functions 26

Log-linear makes potentials apparent In ε 2 (B,C) and ε 4 (D,A) take on values that are constant (-4.61) multiples of 1 and 0 for agree/disagree Such structure is captured by general framework of features Defined next 27

Features in a Markov Network If D is a subset of variables, feature f(d) is a function from D to R (a real value) Feature is a factor without a non-negativity requirement Given a set of k features { f 1 (D 1 ),..f k (D k )} P(X 1,.., X n ) = 1 k Z exp w f (D ) i i i i=1 where w i f i (D i ) is entry in energy function table, since P(X 1,.., X n ) = 1 m Z exp ε i(d i ) i=1 Can have several functions over same scope, k.ne.m So can represent a standard set of table potentials 28

Example of Feature Pairwise Markov Network A B C Variables are binary Three clusters: C 1 = {A,B}, C 2 ={B,C}, C 3 ={C,A} Log-linear model with features f 00 (x,y)=1 if x=0, y=0 and 0 otherwise for x,y instance of C i f 11 (x,y)=1 if x=1, y=1 and 0 otherwise Three data instances (A,B,C): (0,0.0),(0,1,0),(1,0,0) Unnormalized Feature counts are [ ] = (3 +1+1) / 3 = 5 / 3 [ ] = (0 + 0 + 0) / 3 = 0 E! P f 00 E! P f 11 29

Definition of Log-linear model with features A distribution P is a log-linear model over H if A set of k features F={f 1 (D 1 ),..f k (D k )} where each D i is a complete subgraph and a set of weights w i Such that P(X 1,.., X n ) = 1 k Z exp w f (D ) i i i i=1 Note that k is the no of features, not no of subgraphs 30

Determining Parameters: BN vs. MN Classification Problem: Features x ={x 1,..x d } and two-class label y BN: Naïve Bayes (Generative): CPD parameters x 1 x 2 x d P(y, x) = P(y) P(x i y) d i=1 MN: Logistic Regression (Discrim): parameters w i y y Factors: P(y), P(x i y) Factors (log-linear): D i ={x i,y} f i (D i )=x i I(y) ϕ i (x i,y)=exp{w i x i I{y=1}}, ϕ 0 (y)=exp{w 0 I{y=1}} I has value 1 when y=1, else 0 x 1 x 2 x d Joint Probability: From joint get required conditional P(y x) Conditional Unnormalized Normalized!P(y = 1 x) = exp w 0 + Jointly optimize d parameters High dimensional estimation but correlations accounted for Can use much richer features: Edges, image patches sharing same pixels If each x i is discrete with k values independently estimate d(k-1) parameters But independence is false For sparse data generative is better C-class problem: d(k-1)(c-1) parameters d i=1 P(y = 1 x) = sigmoid w 0 + w i x i d i=1 w i x i Logistic Regression!P(y = 0 x) = exp{ 0} = 1 where sigmoid(z) = e z 1+ e = 1 z 1+ e z Z has term 1 because P ~ (y=0 x)=1 C-class p(y c x) = T exp(w c cx) C exp(w T j x) C x d parameters j

Example of binary features P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 Diamond Network With all four variables binaryvalued Features corresponding to this network are sixteen indicator functions Four for each assignment of variables to four pairwise clusters f a (a,b) = I { a = a 0 0 b }I { b = b 0 } 0 With this representation A B ϕ 1 (A,B) a 0 b 0 ϕ a0, b0 a 0 b 1 ϕ a0,b1 a 1 b 0 ϕ a1, b0 ( ) θ a 0 b 0 = lnφ 1 a 0,b 0 a 1 b 1 ϕ a1,b1 (a) Val(A)={a 0,a 1 } Val(B)={b 0,b 1 } ϕ 1 (A,B) is defined for four features f a0,b0, f a0,b1, f a1,b0, and f a1b1 f a0,b0 =1 if a=a 0,b=b 0 0 otherwise, etc. D A C 32 B

Compaction using Features Consider D={A 1,A 2 } each have l values a 1,..a l If we prefer situations in which A 1 =A 2 but no preference for others, energy function is ε(a 1,A 2 ) =3 if A 1 =A 2 = 0 otherwise Represented as a full factor, clique potential (or energy) would need l 2 values Log-linear function in terms of a feature f(a 1,A 2 ) is an indicator function for the event A 1 =A 2 Energy ε is 3 times this feature 33 Φ a 0, b 0 a l, b l Potential

Indicator Feature A type of feature of particular interest Takes on value 1 for some values y Val(D) and 0 for others Example: D ={A 1,A 2 } : each variable has l values a 1,..a l Function ϕ(a 1,A 2 ) is an indicator function for the event A 1 =A 2 E.g., two super-pixels have the same greyscale 34

Use of Features in Text Analysis Machine Learning Compact for variables with large domains B-PER I-PER OTH OTH OTH B-LOC I-LOC B-PER OTH OTH OTH OTH Y= target variables Mrs. Green spoke today in New York Green chairs the finance committe X are words of text, Y are named entities B-Per=Begin Person, I-Per=within Person, O=Not entity Factors for word t: Φ 1 t (Y t,y t+1 ), Φ2 t (Y t,x 1,..X T ) Features (hundreds of thousands): Word itself (capitalised, in list of common names) Aggregate features of sequence (>2 sports related) Y t dependent on several words in a window of t X= known variables Skip chain CRF: connections between adjacent words & 35 multiple occurances of same word

Examples of Feature Parameterization of MNs Text Analysis Ising Model Boltzmann Model Metric MRFs 36

Summary of three MN parameterizations (each finer than previous) 1. Markov network Product of potentials on cliques Good for discussing independence queries 2. Factor Graphs Product of factors describes Gibbs distribution Useful for inference 3. Features P Φ (X 1,..X n )= 1 Z Product of features Can describe all entries in each factor P(X 1,..X n ) where P(X 1,..X n ) = φ i (D i ) is an unnomalized measure and Z = P(X1,..X n ) For both hand-coded models and for learning X 1,..X n P(X 1,.., X n ) = 1 k Z exp w i f i (D i ) i=1 m i=1 37