Rapid Introduction to Machine Learning/ Deep Learning

Similar documents
3 : Representation of Undirected GM

Undirected Graphical Models: Markov Random Fields

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Rapid Introduction to Machine Learning/ Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Conditional Independence and Factorization

Alternative Parameterizations of Markov Networks. Sargur Srihari

3 Undirected Graphical Models

Probabilistic Graphical Models

Review: Directed Models (Bayes Nets)

Directed and Undirected Graphical Models

Probabilistic Graphical Models Lecture Notes Fall 2009

Chapter 16. Structured Probabilistic Models for Deep Learning

Gibbs Fields & Markov Random Fields

Undirected Graphical Models

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Undirected graphical models

Chris Bishop s PRML Ch. 8: Graphical Models

Alternative Parameterizations of Markov Networks. Sargur Srihari

Markov Chains and MCMC

MAP Examples. Sargur Srihari

CSC 412 (Lecture 4): Undirected Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Markov Random Fields

Undirected Graphical Models

CS Lecture 4. Markov Random Fields

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Representation of undirected GM. Kayhan Batmanghelich

Markov properties for undirected graphs

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

6.867 Machine learning, lecture 23 (Jaakkola)

Markov properties for undirected graphs

Intelligent Systems:

Gibbs Field & Markov Random Field

Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015

Graphical Models and Independence Models

EE512 Graphical Models Fall 2009

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

An Introduction to Exponential-Family Random Graph Models

Using Graphs to Describe Model Structure. Sargur N. Srihari

Lecture 4 October 18th

Bayesian Machine Learning - Lecture 7

Restricted Boltzmann Machines

Lecture 6: Graphical Models

Machine Learning Lecture 14

Prof. Dr. Lars Schmidt-Thieme, L. B. Marinho, K. Buza Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course

The Ising model and Markov chain Monte Carlo

Statistical Approaches to Learning and Discovery

Probabilistic Graphical Models

Lecture 16 Deep Neural Generative Models

Directed and Undirected Graphical Models

Markov and Gibbs Random Fields

The Origin of Deep Learning. Lili Mou Jan, 2015

Restricted Boltzmann Machines

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Rapid Introduction to Machine Learning/ Deep Learning

Markov Networks.

Deep unsupervised learning

Chapter 17: Undirected Graphical Models

Undirected Graphical Models 4 Bayesian Networks and Markov Networks. Bayesian Networks to Markov Networks

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Probabilistic Graphical Models

Variational Inference (11/04/13)

Example: multivariate Gaussian Distribution

Kyle Reing University of Southern California April 18, 2018

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Statistical Learning

Machine Learning 4771

Graphical Models and Kernel Methods

Probabilistic Graphical Models

Bayesian Learning in Undirected Graphical Models

Lecture 15. Probabilistic Models on Graph

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

Undirected Graphical Models

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

High dimensional Ising model selection

Why is Deep Learning so effective?

Graphical Models. Outline. HMM in short. HMMs. What about continuous HMMs? p(o t q t ) ML 701. Anna Goldenberg ... t=1. !

Introduction to Restricted Boltzmann Machines

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Discriminative Fields for Modeling Spatial Dependencies in Natural Images

Restricted Boltzmann Machines for Collaborative Filtering

Deep Boltzmann Machines

Probabilistic Graphical Models (I)

Chapter 10: Random Fields

Rapid Introduction to Machine Learning/ Deep Learning

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Lecture 23 Branch-and-Bound Algorithm. November 3, 2009

Undirected graphical models

4.1 Notation and probability review

Probabilistic Graphical Models

Introduction to Graphical Models

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Machine Learning Summer School

Probabilistic Graphical Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Introduction to Probabilistic Graphical Models

Markov Random Fields

Transcription:

Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/24

Lecture 5b Markov random field (MRF) November 13, 2015 2/24

Table of contents 1 1. Objectives of Lecture 5b 2 2. Markov random field (MRF) 2.1. Basics of MRF 2.2. Boltzmann machine 3/24

1. Objectives of Lecture 5b Objective 1 Learn minimal MRF formalism that is necessary for the understanding of the deep neural network pretraining using restricted Boltzmann machine Objective 2 Learn how the probability structure is encoded in the MRF, especially the energy based formalism of Boltzmann machine Objective 3 Learn about some basic formalism of restricted Boltzmann machine 4/24

2.1. Basics of MRF 2. Markov random field (MRF) 2.1.Basics of MRF Terminology G: undirected graph (not necessarily a tree) in which each node represents a random variable Let X i be the random variable represented by node i, and let x i be the value of X i [we frequently confuse node i with x i ] x = (x 1,, x n ) a list of the values of all random variables The joint probability is denoted by P(x) = P(x 1,, x n ) For each node x i, let N (x i ) be the neighbor of x i, i.e. N (x i ) is the set of nodes connected to x i 5/24

2.1. Basics of MRF Definition of MRF We say P(x) satisfies the Markov property, if P(X i = x i X j = x j, for j i) = P(X i = x i X j = x j, for x j N(x i )) G with P(x) satisfying the Markov property is called a Markov random field (MRF) Proposition Let G be a MRF. Let A, B, C be mutually disjoint sets of nodes of G. Assume A and B are separated by C, meaning that every path from a node in A to a node in B passes through some node in C, then P(A, B C) = P(A C)P(B C) 6/24

2.1. Basics of MRF i.e. A and B are conditionally independent given C. The converse is obviously true. Example 7/24

2.1. Basics of MRF Gibbs distributions Definition Clique is a set of nodes every node of which is connected to every other node in the set A probability distribution P(x) is called a Gibbs distribution if it is of the form P(x) = ψ c (x c ), c C where C is the set of maximal cliques and ψ c is a non-negative function of x c, where x c is the list of variables in the clique c 8/24

2.1. Basics of MRF Example maximal cliques: c 1 = {x 1, x 2, x 3 }, c 2 = {x 2, x 3, x 4 }, c 3 = {x 3, x 5 } P(x) = ψ 1 (x 1, x 2, x 3 )ψ 2 (x 2, x 3, x 4 )ψ 3 (x 3, x 5 ) Theorem (Clifford-Hammersley) Assume P(x) > 0 for all x. If P(x) is a Gibbs distribution, then G is an MRF 9/24

2.2. Boltzmann machine 2.2. Boltzmann machine G graph x i {1, 1} or x i {0, 1} 10/24

2.2. Boltzmann machine E: Energy E(x) = i j ω ij x i x j b i x i i i j means the sum over adjacent nodes i and j for i < j P: Probability P(x) = 1 Z exp( λe(x)), where Z is the partition function given by [We usually set λ = 1] Z = exp( λe(x)) x 11/24

Notation x = (x 1,, x d ) x = (h 1,, h n ) (x, h) = (x 1,, x d, h 1,, h n ) 12/24

Energy E(x, h) = w ij x i h j j b j x j c i h i i Probability P(x, h) = 1 Z exp( E(x, h)) Z = exp( E(x, h)) [= exp( E(x, h))] x,h Note The lower the energy, the higher the probability If w ij > 0, it is more likely that x j and h i have the same sign If w ij < 0, it is more likely that x j and h i have the opposite sign If b j > 0, it is more likely that x j > 0, and so on 13/24

Probabilities of RBM Write E(x, h) = h T Wx b T x c T h, where W = (w ij ), h = [h 1,, h n ] T, x = [x 1,, x d ] d P(x, h) = 1 Z exp( ht Wx b T x c T h) P(x) = P(x, h) h P(h) = P(x, h) x 14/24

P(h x) Given x, h i and h j are separated, i.e. conditionally independent. Thus P(h x) = P(h 1,, h n x) = P(h i x) i 15/24

Remark This fact can be proved directly as follows: Let W i = W i be the ith row of W. Then Thus h T Wx = h i W i x. i P(h x) = = exp(h T Wx + b T x + c T h) h exp(h T Wx + b T x + c T h) i exp(h i W i x + c i h i ) h1,,h n exp(h i W i x + c i h i ) 16/24

= i exp(h i W i x + c i h i ) i hi exp(h i W i x + c i h i ) 1 = Z exp(h iw i x + b T x + c i h i ) 1 i Z h i exp(h i W i x + b T x + c i h i ) = i P(x, h i ) hi P(x, h i ) = i P(h i x) 17/24

Special Case: binary neurons Assume x j = {0, 1}, h i {0, 1}. Then P(h i x) = exp(h i W i x + c i h i ) hi exp(h i W i x + c i h i ) 18/24

Thus P(h i = 1 x) = exp(w i x + c i ) 1 + exp(w i x + c i ) = sigm(w ix + c i ) By symmetry P(x j = 1 h) = sigm(w T j h + b j ), where W j is the jth column of W. Now P(x) = P(x, h) h = ebt x Z = ebt x Z i i h i exp(h i W i x + c i h i ) [1 + exp(w i x + c i )] 19/24

x = ebt Z exp log(1 + exp(w i x + c i )) i 1 = Z exp [bt x + log(1 + exp(w i x + c i ))] i 1 = Z exp [bt x + softplus(w i x + c i ))], i where softplus(t) = log(1 + e t ) 20/24

Example Ising model 21/24

x i {1, 1} Configuration x There are 2 n configurations Hamiltonian (Energy) x = {x 1,, x i,, x n } H = i j h ij x i x j b i x i i i j means the sum over adjacent nodes i and j for i < j Probability of configuration x P(x) exp( λh) 22/24

λ = 1 k B T k B Boltzmann constant (usually set to be 1) T temperature If most x i are aligned in the same direction the energy (Hamiltonian) tends to be smaller, thus the probability is bigger Ising model is an idealized magnet model Partition function Thus Z = P(x) x P(x) = 1 Z exp( λh) 23/24

Due to large number of configuration, it is impractical to compute z 24/24