Statistical Approaches to Learning and Discovery

Similar documents
Machine Learning Summer School

Machine Learning Lecture 14

Bayesian Machine Learning

6.867 Machine learning, lecture 23 (Jaakkola)

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Probabilistic Graphical Models (I)

Review: Directed Models (Bayes Nets)

Machine Learning 4771

Chris Bishop s PRML Ch. 8: Graphical Models

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models

3 : Representation of Undirected GM

Machine Learning Summer School

Directed and Undirected Graphical Models

Conditional Independence and Factorization

Probabilistic Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Graphical Models and Kernel Methods

Bayesian Learning in Undirected Graphical Models

Lecture 6: Graphical Models: Learning

p L yi z n m x N n xi

Lecture 6: Graphical Models

Probabilistic Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

9 Forward-backward algorithm, sum-product on factor graphs

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Lecture 8: Bayesian Networks

CSC 412 (Lecture 4): Undirected Graphical Models

2 : Directed GMs: Bayesian Networks

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Probabilistic Graphical Models

Machine Learning for Data Science (CS4786) Lecture 24

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Probabilistic Graphical Networks: Definitions and Basic Results

4 : Exact Inference: Variable Elimination

Probabilistic Graphical Models

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Rapid Introduction to Machine Learning/ Deep Learning

Undirected graphical models

Inference in Bayesian Networks

Junction Tree, BP and Variational Methods

Bayesian Learning in Undirected Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Lecture 4 October 18th

Message Passing Algorithms and Junction Tree Algorithms

Introduction to Artificial Intelligence. Unit # 11

Lecture 9: PGM Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Directed and Undirected Graphical Models

Directed Graphical Models

CS281A/Stat241A Lecture 19

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Bayesian networks: approximate inference

CS Lecture 4. Markov Random Fields

Bayesian Networks: Independencies and Inference

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Variational Scoring of Graphical Model Structures

5. Sum-product algorithm

Representation of undirected GM. Kayhan Batmanghelich

Gibbs Fields & Markov Random Fields

Bayesian Networks BY: MOHAMAD ALSABBAGH

An Introduction to Bayesian Machine Learning

Probabilistic Graphical Models: Representation and Inference

Bayesian Machine Learning - Lecture 7

Bayes Nets: Independence

Learning in Bayesian Networks

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

COMP5211 Lecture Note on Reasoning under Uncertainty

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

4.1 Notation and probability review

Statistical Learning

Probabilistic Graphical Models

Artificial Intelligence Bayesian Networks

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Bayesian Networks: Belief Propagation in Singly Connected Networks

Review: Bayesian learning and inference

Bayesian Networks (Part II)

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Directed Probabilistic Graphical Models CMSC 678 UMBC

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

PROBABILISTIC REASONING SYSTEMS

Transcription:

Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University Spring 2002

Some Examples X 1 X K Y 1 Y 2 Y D Λ factor analysis probabilistic PCA ICA X 1 Y 1 X 2 Y 2 X 3 Y 3 X T Y T hidden Markov models linear dynamical systems U 1 U 2 U 3 S 1 S 2 S 3 X 1 X 2 X 3 Y 1 Y 2 Y 3 switching state-space models

Markov Networks (Undirected Graphical Models) A B C D Examples: Boltzmann Machines Markov Random Fields E Semantics: Every node is conditionally independent from its non-neighbors given its neighbors. Conditional Independence: X Y V p(x Y, V ) = p(x V ) when p(y, V ) > 0. also X Y V p(x, Y V ) = p(x V )p(y V ). Markov Blanket: V is a Markov Blanket for X iff X Y V for all Y / V. Markov Boundary: minimal Markov Blanket

Clique Potentials and Markov Networks Definition: a clique is a fully connected subgraph (usually maximal). C i will denote the set of variables in the i th clique. A B C D E 1. Identify cliques of graph G 2. For each clique C i assign a non-negative function g i (C i ) which measures compatibility. 3. p(x 1,..., X n ) = 1 Z i g i(c i ) where Z = X 1 X n i g i(c i ) is the normalization The graph G embodies the conditional independencies in p (i.e. G is a Markov Field relative to p): If V lies in all paths between X and Y in G, then X Y V.

Hammersley Clifford Theorem (1971) Theorem: A probability function p formed by a normalized product of positive functions on cliques of G is a Markov Field relative to G. Definition: The graph G is a Markov Field relative to p if it does not imply any conditional independence relationships that are not true in p. (We are usually interested in the minimal such graph.) Proof: We need to show that the neighbors of X, ne(x) are a Markov Blanket for X: p(x, Y,...) = 1 g i (C i ) = 1 g i (C i ) g j (C j ) Z Z i i:x C i j:x / C j = 1 Z f 1 ( ) ( ) 1 X, ne(x) f2 ne(x), Y = Z p(x ne(x)) p(y ne(x)) This shows that: p(x, Y ne(x)) = p(x ne(x)) p(y ne(x)) X Y ne(x).

Problems with Markov Networks Many useful independencies are unrepresented two variables are connected merely because some other variable depends on them: Rain Sprinkler Rain Sprinkler Ground wet Ground wet Marginal independence vs. conditional independence. Explaining Away

Bayesian Networks (Directed Graphical Models) A B C D E Semantics: X Y V if V d-separates X from Y. Definition: V d-separates X from Y if along every undirected path between X and Y there is a node W such that either: 1. W has converging arrows along the path ( W ) and neither W nor its descendants are in V, or 2. W does not have converging arrows along the path ( W ) and W V. The Bayes-ball algorithm. Corollary: Markov Blanket for X: {parents(x) children(x) parents-of-children(x)}.

Bayesian Networks (Directed Graphical Models) A B C D E A Bayesian network corresponds to a factorization of the joint probability distribution: p(a, B, C, D, E) = p(a)p(b)p(c A, B)p(D B, C)p(E C, D) In general: p(x 1,..., X n ) = where pa(i) are the parents of node i. n p(x i Xpa(i)) i=1

From Bayesian Trees to Markov Trees 1 5 3 4 6 2 7 p(1, 2,..., 7) = p(3)p(1 3)p(2 3)p(4 3)p(5 4)p(6 4)p(7 4) = = p(1, 3)p(2, 3)p(3, 4)p(4, 5)p(4, 6)p(4, 7) p(3)p(3)p(4)p(4)p(4) product of cliques product of clique intersections = g(1, 3)g(2, 3)g(3, 4)g(4, 5)g(4, 6)g(4, 7) = i g i (C i )

Belief Propagation (in Singly Connected Bayesian Networks) Definition: S.C.B.N. has an undirected underlying graph which is a tree, ie there is only one path between any two nodes. Goal: For some node X we want to compute p(x e) given evidence e. Since we are considering S.C.B.N.s: every node X divides the evidence into upstream e + X and downstream e X every arc X Y divides the evidence into upstream e + XY and downstream e XY.

The three key ideas behind Belief Propagation Idea 1: Our belief about the variable X can be found by combining upstream and downstream evidence: p(x e) = p(x, e) p(e) = p(x, e+ X, e X ) p(e + X, e X ) p(x e + X ) } p(e X X, {{ e+ X } ) X d-separates e X from e+ X = p(x e + X )p(e X X) = π(x)λ(x) Idea 2: The upstream and downstream evidence can be computed via a local message passing algorithm between the nodes in the graph. Idea 3: Don t send back to a node (any part of) the message it sent to you!

Belief Propagation U 1... U n top-down causal support: π X (U i ) = p(u i e + U i X ) Y 1 X... Y m bottom-up diagnostic support: λ Yj (X) = p(e XY j X) To update the belief about X: BEL(X) = 1 Z λ(x)π(x) λ(x) = j λ Yj (X) π(x) = p(x U 1,..., U n ) π X (U i ) U 1 U n i

Belief Propagation, cont U 1... U n top-down causal support: π X (U i ) = p(u i e + U i X ) Y 1 X... Y m bottom-up diagnostic support: λ Yj (X) = p(e XY j X) Bottom-up propagation, X sends to U i : λ X (U i ) = 1 Z λ(x) π X (U k ) X U k :k i p(x U 1,..., U n ) k i Top-down propagation, X sends to Y j : π Yj (X) = 1 Z [ k j λ Yk (X) ] p(x U 1,..., U n ) π X (U i ) = 1 Z U 1 U n i BEL(X) λ Yj (X)

Belief Propagation in multiply connected Bayesian Networks The Junction Tree algorithm: Form an undirected graph from your directed graph such that no additional conditional independence relationships have been created (this step is called moralization ). Lump variables in cliques together and form a tree of cliques this may require a nasty step called triangulation. Do inference in this tree. Cutset Conditioning: or reasoning by assumptions. Find a small set of variables which, if they were given (i.e. known) would render the remaining graph singly connected. For each value of these variables run belief propagation on the singly connected network. Average the resulting beliefs with the appropriate weights. Loopy Belief Propagation: just use BP although there are loops. In this case the terms upstream and downstream are not clearly defined. No guarantee of convergence, but often works well in practice.

Learning with Hidden Variables: The EM Algorithm X 1 θ 1 θ 2 X 2 X 3 θ 3 θ 4 Y Assume a model parameterised by θ with observable variables Y and hidden variables X Goal: maximise log likelihood of observables. L(θ) = ln p(y θ) = ln X p(y, X θ) E-step: first infer p(x Y, θ old ), then M-step: find θ new using complete data learning The E-step requires solving the inference problem: finding explanations, X, for the data, Y, given the current model, θ (using e.g. BP).

Expressive Power of Bayesian and Markov Networks No Bayesian network can represent these and only these independencies No matter how we direct the arrows there will always be two non-adjacent parents sharing a common child = dependence in Bayesian network but independence in Markov network. No Markov network can represent these and only these independencies