Machine Learning Summer School

Similar documents
Lecture 6: Graphical Models: Learning

Variational Scoring of Graphical Model Structures

Unsupervised Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Graphical models: parameter learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Statistical Approaches to Learning and Discovery

Bayesian Machine Learning

An Introduction to Bayesian Machine Learning

Learning Bayesian network : Given structure and completely observed data

STA 4273H: Statistical Machine Learning

Introduction to Probabilistic Graphical Models

Lecture 9: PGM Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Learning Bayesian networks

13: Variational inference II

An Empirical-Bayes Score for Discrete Bayesian Networks

Bayesian Learning in Undirected Graphical Models

Dynamic Approaches: The Hidden Markov Model

Learning in Bayesian Networks

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

p L yi z n m x N n xi

STA 4273H: Statistical Machine Learning

Approximate Inference using MCMC

Structure Learning in Markov Random Fields

Bayesian Learning in Undirected Graphical Models

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Statistical Approaches to Learning and Discovery

Basic math for biology

Probabilistic Graphical Models

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Beyond Uniform Priors in Bayesian Network Structure Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Expectation Propagation for Approximate Bayesian Inference

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture 13 : Variational Inference: Mean Field Approximation

Time-Sensitive Dirichlet Process Mixture Models

An introduction to Variational calculus in Machine Learning

Recent Advances in Bayesian Inference Techniques

Density Estimation. Seungjin Choi

Probabilistic and Bayesian Machine Learning

Bayesian Model Scoring in Markov Random Fields

Bayesian Inference for Dirichlet-Multinomials

Machine Learning for Data Science (CS4786) Lecture 24

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

The Monte Carlo Method: Bayesian Networks

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Bayesian Machine Learning - Lecture 7

Bayesian Hidden Markov Models and Extensions

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Probabilistic Graphical Networks: Definitions and Basic Results

CPSC 540: Machine Learning

Part 1: Expectation Propagation

Lecture 8: Graphical models for Text

Tree-Based Inference for Dirichlet Process Mixtures

Lecture : Probabilistic Machine Learning

Expectation Propagation Algorithm

Variational Inference. Sargur Srihari

Graphical Models for Collaborative Filtering

Sampling Algorithms for Probabilistic Graphical models

Junction Tree, BP and Variational Methods

Bayesian Networks Inference with Probabilistic Graphical Models

Approximate Inference Part 1 of 2

Graphical Models and Kernel Methods

Bayesian Methods for Machine Learning

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Motif representation using position weight matrix

Bayesian Networks. Motivation

Approximate Inference Part 1 of 2

13 : Variational Inference: Loopy Belief Propagation and Mean Field

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Inferring Protein-Signaling Networks II

an introduction to bayesian inference

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

An Empirical-Bayes Score for Discrete Bayesian Networks

Bayesian Machine Learning

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

Lecture 7 and 8: Markov Chain Monte Carlo

Probabilistic Modelling and Bayesian Inference

Probabilistic Graphical Models

Introduction to Bayesian inference

STA 414/2104: Machine Learning

Structure Learning: the good, the bad, the ugly

Probabilistic Modelling and Bayesian Inference

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

STA 4273H: Statistical Machine Learning

Machine Learning Summer School

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Bayesian Inference Course, WTCN, UCL, March 2013

4: Parameter Estimation in Fully Observed BNs

The Origin of Deep Learning. Lili Mou Jan, 2015

Probabilistic Graphical Models

COMP538: Introduction to Bayesian Networks

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Bayesian Machine Learning

Transcription:

Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge, UK Machine Learning Department Carnegie Mellon University, USA August 2009

x 1 x2 Learning parameters x 3 x 4 p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 ) θ 2 x 1 x 2 0.2 0.3 0.5 0.1 0.6 0.3 Assume each variable x i is discrete and can take on K i values. The parameters of this model can be represented as 4 tables: θ 1 has K 1 entries, θ 2 has K 1 K 2 entries, etc. These are called conditional probability tables (CPTs) with the following semantics: p(x 1 = k) = θ 1,k p(x 2 = k x 1 = k) = θ 2,k,k If node i has M parents, θ i ( can be represented either as an M + 1 dimensional table, or as a 2-dimensional table with j pa(i) j) K K i entries by collapsing all the states of the parents of node i. Note that k θ i,k,k = 1. Assume a data set D = {x (n) } N n=1. How do we learn θ from D?

Learning parameters x 1 x2 Assume a data set D = {x (n) } N n=1. How do we learn θ from D? x 3 x 4 Likelihood: Log Likelihood: p(x θ) = p(x 1 θ 1 )p(x 2 x 1, θ 2 )p(x 3 x 1, θ 3 )p(x 4 x 2, θ 4 ) N p(d θ) = p(x (n) θ) log p(d θ) = n=1 N n=1 i log p(x (n) i x (n) pa(i), θ i) This decomposes into sum of functions of θ i. Each θ i can be optimized separately: ˆθ i,k,k = n i,k,k k n i,k,k where n i,k,k is the number of times in D where x i = k and x pa(i) = k, where k represents a joint configuration of all the parents of i (i.e. takes on one of j pa(i) K j values) n 2 x 2 θ 2 x 2 ML solution: Simply calculate frequencies! x 1 2 3 0 3 1 6 x 1 0.4 0.6 0 0.3 0.1 0.6

Deriving the Maximum Likelihood Estimate p(y x, θ) = k,l θ δ(x,k)δ(y,l) k,l Dataset D = {(x (n), y (n) ) : n = 1..., N} θ x y L(θ) = log n = log n p(y (n) x (n), θ) k,l θ δ(x(n),k)δ(y (n),l) k,l = n,k,l δ(x (n), k)δ(y (n), l) log θ k,l ( ) = k,l δ(x (n), k)δ(y (n), l) log θ k,l = k,l n k,l log θ k,l n Maximize L(θ) w.r.t. θ subject to l θ k,l = 1 for all k.

Maximum Likelihood Learning with Hidden Variables X 1 θ 1 θ 2 X 2 X 3 θ 3 Assume a model parameterised by θ with observable variables Y and hidden variables X θ 4 Y Goal: maximize parameter log likelihood given observed data. L(θ) = log p(y θ) = log X p(y, X θ)

Maximum Likelihood Learning with Hidden Variables: The EM Algorithm Goal: maximise parameter log likelihood given observables. X 1 θ 1 L(θ) = log p(y θ) = log X p(y, X θ) θ 2 X 2 X 3 θ 3 The Expectation Maximization (EM) algorithm (intuition): θ 4 Y Iterate between applying the following two steps: The E step: fill-in the hidden/missing variables The M step: apply complete data learning to filled-in data.

Maximum Likelihood Learning with Hidden Variables: The EM Algorithm Goal: maximise parameter log likelihood given observables. L(θ) = log p(y θ) = log X p(y, X θ) The EM algorithm (derivation): L(θ) = log X p(y, X θ) q(x) q(x) X q(x) log p(y, X θ) q(x) = F(q(X), θ) The E step: maximize F(q(X), θ [t] ) wrt q(x) holding θ [t] fixed: q(x) = p(x Y, θ [t] ) The M step: maximize F(q(X), θ) wrt θ holding q(x) fixed: θ [t+1] argmax θ q(x) log p(y, X θ) The E-step requires solving the inference problem, finding the distribution over the hidden variables p(x Y, θ [t] ) given the current model parameters. This can be done using belief propagation or the junction tree algorithm. X

Maximum Likelihood Learning without and with Hidden Variables ML Learning with Complete Data (No Hidden Variables) Log likelihood decomposes into sum of functions of θ i. Each θ i can be optimized separately: ˆθ ijk n ijk k n ijk where n ijk is the number of times in D where x i = k and x pa(i) = j. Maximum likelihood solution: Simply calculate frequencies! ML Learning with Incomplete Data (i.e. with Hidden Variables) Iterative EM algorithm E step: compute expected counts given previous settings of parameters E[n ijk D, θ [t] ]. M step: re-estimate parameters using these expected counts θ [t+1] ijk E[n ijk D, θ [t] ] k E[n ijk D, θ [t] ]

Bayesian Learning Apply the basic rules of probability to learning from data. Data set: D = {x 1,..., x n } Models: m, m etc. Model parameters: θ Prior probability of models: P (m), P (m ) etc. Prior probabilities of model parameters: P (θ m) Model of data given parameters (likelihood model): P (x θ, m) If the data are independently and identically distributed then: n P (D θ, m) = P (x i θ, m) i=1 Posterior probability of model parameters: P (θ D, m) = P (D θ, m)p (θ m) P (D m) Posterior probability of models: P (m D) = P (m)p (D m) P (D)

Bayesian parameter learning with no hidden variables Let n ijk be the number of times (x (n) i = k and x (n) pa(i) = j) in D. For each i and j, θ ij is a probability vector of length K i 1. Since x i is a discrete variable with probabilities given by θ i,j,, the likelihood is: p(d θ) = n i p(x (n) i x (n) pa(i), θ) = i j k θ n ijk ijk If we choose a prior on θ of the form: p(θ) = c i j k θ α ijk 1 ijk where c is a normalization constant, and k θ ijk = 1 i, j, then the posterior distribution also has the same form: p(θ D) = c θ α ijk 1 ijk i j k where α ijk = α ijk + n ijk. This distribution is called the Dirichlet distribution.

Dirichlet Distribution The Dirichlet distribution is a distribution over the K-dim probability simplex. Let θ be a K-dimensional vector s.t. j : θ j 0 and K j=1 θ j = 1 p(θ α) = Dir(α 1,..., α K ) def = Γ( j α j) j Γ(α j) K j=1 θ α j 1 j where the first term is a normalization constant 1 and E(θ j ) = α j /( k α k) The Dirichlet is conjugate to the multinomial distribution. Let x θ Multinomial( θ) That is, p(x = j θ) = θ j. Then the posterior is also Dirichlet: p(x = j θ)p(θ α) p(θ x = j, α) = = Dir( α) p(x = j α) where α j = α j + 1, and l j : α l = α l 1 Γ(x) = (x 1)Γ(x 1) = R 0 tx 1 e t dt. For integer n, Γ(n) = (n 1)!

Dirichlet Distributions Examples of Dirichlet distributions over θ = (θ 1, θ 2, θ 3 ) which can be plotted in 2D since θ 3 = 1 θ 1 θ 2 :

Example Assume α ijk = 1 i, j, k. This corresponds to a uniform prior distribution over parameters θ. This is not a very strong/dogmatic prior, since any parameter setting is assumed a priori possible. After observed data D, what are the parameter posterior distributions? p(θ ij D) = Dir(n ij + 1) This distribution predicts, for future data: p(x i = k x pa(i) = j, D) = n ijk + 1 k (n ijk + 1) Adding 1 to each of the counts is a form of smoothing called Laplace s Rule.

Bayesian parameter learning with hidden variables Notation: let D be the observed data set, X be hidden variables, and θ be model parameters. Assume discrete variables and Dirichlet priors on θ Goal: to infer p(θ D) = X p(x, θ D) Problem: since (a) p(θ D) = X p(θ X, D)p(X D), and (b) for every way of filling in the missing data, p(θ X, D) is a Dirichlet distribution, and (c) there are exponentially many ways of filling in X, it follows that p(θ D) is a mixture of Dirichlets with exponentially many terms! Solutions: Find a single best ( Viterbi ) completion of X (Stolcke and Omohundro, 1993) Markov chain Monte Carlo methods Variational Bayesian (VB) methods (Beal and Ghahramani, 2003)

Summary of parameter learning Complete (fully observed) data Incomplete (hidden /missing) data ML calculate frequencies EM Bayesian update Dirichlet distributions MCMC / Viterbi / VB For complete data Bayesian learning is not more costly than ML For incomplete data VB EM time complexity Other parameter priors are possible but Dirichlet is pretty flexible and intuitive. For non-discrete data, similar ideas but generally harder inference and learning.

Structure learning Given a data set of observations of (A, B, C, D, E) can we learn the structure of the graphical model? A D C B E A D C B E A D C B E A D C B E Let m denote the graph structure = the set of edges.

Structure learning A B A B A B A B C C C C E D E D E D E D Constraint-Based Learning: Use statistical tests of marginal and conditional independence. Find the set of DAGs whose d-separation relations match the results of conditional independence tests. Score-Based Learning: Use a global score such as the BIC score or Bayesian marginal likelihood. Find the structures that maximize this score.

Score-based structure learning for complete data Consider a graphical model with structure m, discrete observed data D, and parameters θ. Assume Dirichlet priors. The Bayesian marginal likelihood score is easy to compute: score(m) = log p(d m) = log p(d θ, m)p(θ m)dθ score(m) = X i " X log Γ( X j k α ijk ) X k log Γ(α ijk ) log Γ( X k α ijk ) + X k where α ijk = α ijk + n ijk. Note that the score decomposes over i. log Γ( α ijk ) # One can incorporate structure prior information p(m) as well: score(m) = log p(d m) + log p(m) Greedy search algorithm: Start with m. Consider modifications m m (edge deletions, additions, reversals). Accept m if score(m ) > score(m). Repeat. Bayesian inference of model structure: Run MCMC on m.

Bayesian Structural EM for incomplete data Consider a graphical model with structure m, observed data D, hidden variables X and parameters θ The Bayesian score is generally intractable to compute: score(m) = p(d m) = p(x, θ, D m)dθ Bayesian Structure EM (Friedman, 1998): 1. compute MAP parameters ˆθ for current model m using EM 2. find hidden variable distribution p(x D, ˆθ) 3. for a small set of candidate structures compute or approximate X score(m ) = X p(x D, ˆθ) log p(d, X m ) 4. m m with highest score

Directed Graphical Models and Causality Causal relationships are a fundamental component of cognition and scientific discovery. Even though the independence relations are identical, there is a causal difference between smoking yellow teeth yellow teeth smoking Key idea: interventions and the do-calculus: p(s Y = y) p(s do(y = y)) p(y S = s) = p(y do(s = s)) Causal relationships are robust to interventions on the parents. The key difficulty in learning causal relationships from observational data is the presence of hidden common causes: A H B A B A B

Learning parameters and structure in undirected graphs A B A B p(x θ) = 1 Z(θ) E j g j(x Cj ; θ j ) where Z(θ) = x C D E C D j g j(x Cj ; θ j ). Problem: computing Z(θ) is computationally intractable for general (non-tree-structured) undirected models. Therefore, maximum-likelihood learning of parameters is generally intractable, Bayesian scoring of structures is intractable, etc. Solutions: directly approximate Z(θ) and/or its derivatives (cf. Boltzmann machine learning; contrastive divergence; pseudo-likelihood) use approx inference methods (e.g. loopy belief propagation, bounding methods, EP). See: (Murray and Ghahramani, 2004; Murray et al, 2006) for Bayesian learning in undirected models.

Summary Parameter learning in directed models: complete and incomplete data; ML and Bayesian methods Structure learning in directed models: complete and incomplete data Causality Parameter and Structure learning in undirected models

Readings and References Beal, M.J. and Ghahramani, Z. (2006) Variational Bayesian learning of directed graphical models with hidden variables. Bayesian Analysis 1(4):793 832. http://learning.eng.cam.ac.uk/zoubin/papers/beagha06.pdf Friedman, N. (1998) The Bayesian structural EM algorithm. In Uncertainty in Artificial Intelligence (UAI-1998). http://robotics.stanford.edu/ nir/papers/fr2.pdf Friedman, N. and Koller, D. (2003) Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning. 50(1): 95 125. http://www.springerlink.com/index/nq13817217667435.pdf Ghahramani, Z. (2004) Unsupervised Learning. In Bousquet, O., von Luxburg, U. and Raetsch, G. Advanced Lectures in Machine Learning. 72-112. http://learning.eng.cam.ac.uk/zoubin/papers/ul.pdf Heckerman, D. (1995) A tutorial on learning with Bayesian networks. In Learning in Graphical Models. http://research.microsoft.com/pubs/69588/tr-95-06.pdf Murray, I.A., Ghahramani, Z., and MacKay, D.J.C. (2006) MCMC for doubly-intractable distributions. In Uncertainty in Artificial Intelligence (UAI-2006). http://learning.eng.cam.ac.uk/zoubin/papers/doubly intractable.pdf Murray, I.A. and Ghahramani, Z. (2004) Bayesian Learning in Undirected Graphical Models: Approximate MCMC algorithms. In Uncertainty in Artificial Intelligence (UAI-2004). http://learning.eng.cam.ac.uk/zoubin/papers/uai04murray.pdf Stolcke, A. and Omohundro, S. (1993) Hidden Markov model induction by Bayesian model merging. In Advances in Neural Information Processing Systems (NIPS). http://omohundro.files.wordpress.com/2009/03/stolcke omohundro93 hmm induction bayesian model merging.pdf