Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Similar documents
Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Markov Networks.

CS 343: Artificial Intelligence

CS 188: Artificial Intelligence. Bayes Nets

Undirected Graphical Models

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Uncertainty and Bayesian Networks

CSC 412 (Lecture 4): Undirected Graphical Models

Undirected Graphical Models: Markov Random Fields

Bayes Nets: Sampling

Relational Learning. CS 760: Machine Learning Spring 2018 Mark Craven and David Page.

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Probabilistic Graphical Models

15-780: Grad AI Lecture 19: Graphical models, Monte Carlo methods. Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik Zawadzki, Abe Othman

Undirected graphical models

Review: Directed Models (Bayes Nets)

Unifying Logical and Statistical AI

Gibbs Field & Markov Random Field

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Learning in Undirected Graphical Models

Machine Learning for Data Science (CS4786) Lecture 24

Lecture 6: Graphical Models

Artificial Intelligence

STA 4273H: Statistical Machine Learning

3 : Representation of Undirected GM

Lifted Inference: Exact Search Based Algorithms

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Bayes Net Representation. CS 188: Artificial Intelligence. Approximate Inference: Sampling. Variable Elimination. Sampling.

Probabilistic Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

Graphical Models and Kernel Methods

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Lecture 9: PGM Learning

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Markov networks: Representation

CSCE 478/878 Lecture 6: Bayesian Learning

Essential facts about NP-completeness:

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

STA 4273H: Statistical Machine Learning

Graphical Models - Part I

Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015

3 Undirected Graphical Models

Artificial Intelligence Bayesian Networks

Lecture 6: Graphical Models: Learning

PROBABILISTIC REASONING SYSTEMS

Directed and Undirected Graphical Models

Probabilistic Graphical Models

Conditional Independence and Factorization

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Directed Graphical Models

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Machine Learning Summer School

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Variational Inference. Sargur Srihari

Probabilistic Graphical Models

Notes on Markov Networks

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Statistical Approaches to Learning and Discovery

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

MCMC and Gibbs Sampling. Kayhan Batmanghelich

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Probabilistic Graphical Models

Machine Learning Lecture 14

Reasoning Under Uncertainty: Introduction to Probability

Probabilistic Graphical Models

Markov Chains and MCMC

Bayesian Learning in Undirected Graphical Models

Artificial Intelligence Bayes Nets: Independence

Intelligent Systems (AI-2)

Sampling Algorithms for Probabilistic Graphical models

Gibbs Fields & Markov Random Fields

Bayes Nets III: Inference

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Probabilistic Graphical Networks: Definitions and Basic Results

Inference in Bayesian Networks

The Ising model and Markov chain Monte Carlo

Lecture 16 Deep Neural Generative Models

Lecture 8. Probabilistic Reasoning CS 486/686 May 25, 2006

Bayesian networks. Chapter Chapter

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

CS281A/Stat241A Lecture 19

Intelligent Systems (AI-2)

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

4 : Exact Inference: Variable Elimination

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Machine Learning 4771

CS 188: Artificial Intelligence Spring Announcements

Quantifying uncertainty & Bayesian networks

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Notes on Machine Learning for and

Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Transcription:

Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables

Markov Networks l Unlike Bayes Nets l Undirected graph l No requirement that tables need not be are conditional distributions l Table distributed over complete subgraph

More on Potentials Values are typically nonnegative A a a c c 3 1 2 5 Values need not be probabilities Generally, one table associated with each clique B D E d d C c c 1 2 3 4

Calculating the Full Joint Probability Density Full Joint Probability Density is the normalized product of the event probabilities Normalization constant One potential Feature vector (i.e. )

Calculating the Normalization Constant Z

Using Get probability of A=1, B=0, C=1, D=0, E=0 A a a c c 3 1 2 5 Only need potentials Multiply entries consistent with this setting (3 x 3 = 9) B D E d d C c c 1 2 3 4

Markov Nets versus Bayes Nets Disadvantages of Markov Nets Computationally intensive to compute full joint density with Markov Net, easy for Bayes Hard to learn Markov Net parameters in a straightforward way

Markov Nets versus Bayes Nets Advantages of Markov Nets Easy to talk about conditional independence Don t have to think about deseparation Conditional independence is achieved if all paths are cut off Markov blanket is easy to calculate: just the node and its neighbors No need to select an arbitrary, potentially misleading direction for a dependency in cases where the direction is unclear

Constructing Markov Nets Just as in Bayes Nets, the decision of which tables to represent is based on background knowledge Although the model can be built from the data, it is often easier for people to leverage domain knowledge Although the model is undirected, it can still be helpful to think of directionality when constructing the Markov Net

Scale Invariance The change at the right will c c not effect the joint probability distribution. A a a 3 2 1 5 B D C c c c c d 1 2 d 10 20 E d 3 4 d 30 40

Inference Almost the same as in Bayes Nets (this is somewhat surprising considering all the other differences!) Possible approaches: Gibbs sampling Variable elimination Belief propagation

Learning: Recall the Bayes Net approach In Bayes Nets, we go through each variable one at a time, row by row in the CPT adjusting weights One way to think of this approach is that we look at the prior setting and ask what the probability of this setting is based on what we see in the data, then adjust the CPT to be consistent with the data

Can we use this approach on Markov Nets? No! Consider changing a single table value. This change the partition function, Z. Thus, a local change to one table effects other tables; local changes have global effects!

Markov Net Learning We want to get the derivative of the maximum likelihood function. We can then incrementally move each weight in direction of the gradient based on a learning parameter η The above approach amounts to differencing the expectation of priors and observed occurrences, computed as on the next slide

Markov Net Learning, continued Assume that the dataset is composed of M datapoints. Consider the task of computing the expectation of priors and observed occurrences for A B Expectation of priors: M Pr(A B) Observed occurrences: Number of datapoints for which A and B hold Using this approach, it can be shown that gradient ascend converges

Log Linear Models Equivalent to Markov Nets (though they look different) Take the natural log of each parameter b b a a ln p 1 ln p 2 ln p 3 Ln p 4 A B D C E

Log Linear Models This change allows us to write the probability density function as: exp(x) = e X ln potential values Logical statements, either 1 or 0 Also known as indicator functions For example, f 1 = a b f 2 = a b

Analyzing In this formulation, the w s are just weights and the f s are just features As such, we can throw the graph out if we want we have everything we need in the w i s and f i s In this view, parameter learning is just weight learning

Statistical Relational Learning (SRL) For the most part, up until now, we have assumed feature vectors as our data representation In many cases, a database model is more likely Limitations of ILP ILP that learned rules was somewhat robust to noise, but still used a closed world model There is little that is unconditionally true in the real world SRL addresses these limitations

Markov Logic Allows one to make statements that are usually true Example: weight 0 Attach weights to each rule. The probability of a setting that violates a rule drops off exponentially with the weight of a rule

Markov Logic, continued All variables, need not be universally quantified, but we assume so for here to ease notation Rules are mapped into a Markov Network Syntactically we are dealing with predicate calculus (in our example, constants are people) Semantically, we are dealing with a joint probability distribution over all facts (ground atomic formulas) A world is a truth assignment: we have probabilities for each world based on weight

Translating Markov Logic to a Markov Net Create ground instances through substitution Create a node for each fact Create an edge between nodes if both appear in a ground instance

Example translated to Markov Network Facts: Rules: Friend(x,y) Smokes(x) -> Smokes(y) friend(joe,mary) other facts, like cancer(mary) smokes(mary) smokes(joe) other facts, like cancer(joe) friend(mary,joe)

Computing weights Consider the effect of this rule, called for convenience: weight 1.1 smokes(joe) smokes(mary) smokes(joe) smokes(mary) smokes(joe) smokes(joe) friends(mary,joe) friends(mary,joe) e 1.1 e 1.1 e 1.1 e 1.1 1 e 1.1 e 1.1 e 1.1 This is the only predicate that does not satisfy Thus, it is given value 1, while the others are Given value exp(weight( ))

Markov Networks l Undirected graphical models Smoking Cancer Asthma Cough l Potential functions defined over cliques 1 P ( x) = Φc( x c ) Z Z ( ) c = Φ x c c x c Smoking Cancer Ф(S,C) False False 4.5 False True 4.5 True False 2.7 True True 4.5

Markov Networks l Undirected graphical models Smoking Cancer l Log-linear model: Asthma 1 P ( x) = exp wi fi ( x) Z i Cough Weight of Feature i Feature i f 1 (Smoking, w 1 = 1.5 Cancer ) = 1 0 if Smoking otherwise Cancer

Hammersley-Clifford Theorem If Distribution is strictly positive (P(x) > 0) And Graph encodes conditional independences Then Distribution is product of potentials over cliques of graph Inverse is also true. ( Markov network = Gibbs distribution )

Markov Nets vs. Bayes Nets Property Markov Nets Bayes Nets Form Prod. potentials Prod. potentials Potentials Arbitrary Cond. probabilities Cycles Allowed Forbidden Partition func. Z =? Z = 1 Indep. check Graph separation D-separation Indep. props. Some Some Inference MCMC, BP, etc. Convert to Markov

Computing Probabilities l Goal: Compute marginals & conditionals of 1 PX ( ) = exp wf i i( X) Z Z = exp wifi( X) i X i l Exact inference is #P-complete l Approximate inference l Monte Carlo methods l Belief propagation l Variational approximations

Markov Chain Monte Carlo l General algorithm: Metropolis-Hastings l Sample next state given current one according to transition probability l Reject new state with some probability to maintain detailed balance l Simplest (and most popular) algorithm: Gibbs sampling l Sample one variable at a time given the rest P( x MB( x)) = exp ( wf ) i i( x) i ( wf ) ( i i x= + wf i i x= ) exp ( 0) exp ( 1) i i

Learning Markov Networks l Learning parameters (weights) l Generatively l Discriminatively l Learning structure (features) l In this lecture: Assume complete data (If not: EM versions of algorithms)

Generative Weight Learning l Maximize likelihood or posterior probability l Numerical optimization (gradient or 2 nd order) l No local maxima w i log P w ( x) = ( x) E [ n ( x) ] No. of times feature i is true in data Expected no. times feature i is true according to model l Requires inference at each step (slow!) n i w i

Pseudo-Likelihood PL( x) i P( x neighbors ( x i i )) l Likelihood of each variable given its neighbors in the data l Does not require inference at each step l Consistent estimator l Widely used in vision, spatial statistics, etc. l But PL parameters may not work well for long inference chains

Structure Learning l Start with atomic features l Greedily conjoin features to improve score l Problem: Need to reestimate weights for each new candidate l Approximation: Keep weights of previous features constant

Markov Networks l Undirected graphical models Smoking Cancer l Log-linear model: Asthma 1 P ( x) = exp wi fi ( x) Z i Cough Weight of Feature i Feature i f 1 (Smoking, w 1 = 1.5 Cancer ) = 1 0 if Smoking otherwise Cancer

Inference in Markov Networks l Goal: Compute marginals & conditionals of 1 PX ( ) = exp wf i i( X) Z Z = exp wifi( X) i X i l Exact inference is #P-complete l Conditioning on Markov blanket of a proposition x is easy, because you only have to consider cliques (formulas) that involve x : P( x MB( x)) = l Gibbs sampling exploits this exp ( wf ) i i( x) i ( wf ) ( i i x= + wf i i x= ) exp ( 0) exp ( 1) i i

MCMC: Gibbs Sampling state random truth assignment for i 1 to num-samples do for each variable x sample x according to P(x neighbors(x)) state state with new value of x P(F) fraction of states in which F is true

Generative Weight Learning l Maximize likelihood or posterior probability l Numerical optimization (gradient or 2 nd order) l No local maxima w i log P w ( x) = ( x) E [ n ( x) ] No. of times feature i is true in data Expected no. times feature i is true according to model l Requires inference at each step (slow!) n i w i

Pseudo-Likelihood PL( x) i P( x neighbors ( x i i )) l Likelihood of each variable given its neighbors in the data l Does not require inference at each step l Consistent estimator l Widely used in vision, spatial statistics, etc. l But PL parameters may not work well for long inference chains

Discriminative Weight Learning l Maximize conditional likelihood of query (y) given evidence (x) w i log P w ( y x) = ( x, No. of true groundings of clause i in data n i y) [ n ( x, y) ] Expected no. true groundings according to model l Approximate expected counts by counts in MAP state of y given x E w i

Rule Induction l Given: Set of positive and negative examples of some concept l Example: (x 1, x 2,, x n, y) l y: concept (Boolean) l x 1, x 2,, x n : attributes (assume Boolean) l Goal: Induce a set of rules that cover all positive examples and no negative ones l l l Rule: x a ^ x b ^ -> y (x a : Literal, i.e., x i or its negation) Same as Horn clause: Body Head Rule r covers example x iff x satisfies body of r l Eval(r): Accuracy, info. gain, coverage, support, etc.

Learning a Single Rule head y body Ø repeat for each literal x r x r with x added to body Eval(r x ) body body ^ best x until no x improves Eval(r) return r

Learning a Set of Rules R Ø S examples repeat learn a single rule r R R U { r } S S positive examples covered by r until S contains no positive examples return R

First-Order Rule Induction l y and x i are now predicates with arguments E.g.: y is Ancestor(x,y), x i is Parent(x,y) l Literals to add are predicates or their negations l Literal to add must include at least one variable already appearing in rule l Adding a literal changes # groundings of rule E.g.: Ancestor(x,z) ^ Parent(z,y) -> Ancestor(x,y) l Eval(r) must take this into account E.g.: Multiply by # positive groundings of rule still covered after adding literal