CPSC 540: Machine Learning

Similar documents
CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CPSC 540: Machine Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CPSC 540: Machine Learning

Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

The Generative Self-Organizing Map. A Probabilistic Generalization of Kohonen s SOM

Self-Organization by Optimizing Free-Energy

CPSC 540: Machine Learning

CPSC 540: Machine Learning

STA 4273H: Statistical Machine Learning

11 : Gaussian Graphic Models and Ising Models

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Machine Learning for Data Science (CS4786) Lecture 24

CPSC 540: Machine Learning

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

CSC 412 (Lecture 4): Undirected Graphical Models

CPSC 540: Machine Learning

Probabilistic Graphical Models

Semi-Markov/Graph Cuts

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Models

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

CPSC 540: Machine Learning

Variational Inference (11/04/13)

Graphical Models and Kernel Methods

Review: Directed Models (Bayes Nets)

BAYESIAN MACHINE LEARNING.

Chapter 16. Structured Probabilistic Models for Deep Learning

Lecture 6: Graphical Models

Probabilistic Graphical Networks: Definitions and Basic Results

Based on slides by Richard Zemel

Bayesian Machine Learning - Lecture 7

Example: multivariate Gaussian Distribution

Probabilistic Graphical Models (I)

Undirected Graphical Models: Markov Random Fields

Directed and Undirected Graphical Models

The Origin of Deep Learning. Lili Mou Jan, 2015

2 : Directed GMs: Bayesian Networks

Junction Tree, BP and Variational Methods

Probabilistic Graphical Models

14 : Theory of Variational Inference: Inner and Outer Approximation

Introduction to Machine Learning Midterm, Tues April 8

Intelligent Systems:

Lecture 8: PGM Inference

CPSC 540: Machine Learning

14 : Theory of Variational Inference: Inner and Outer Approximation

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

CPSC 540: Machine Learning

CS Lecture 4. Markov Random Fields

Probabilistic Graphical Models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Bayesian Machine Learning

Directed Graphical Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Intelligent Systems (AI-2)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Chapter 17: Undirected Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Alternative Parameterizations of Markov Networks. Sargur Srihari

3 : Representation of Undirected GM

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Machine Learning, Fall 2009: Midterm

Representation of undirected GM. Kayhan Batmanghelich

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Submodularity in Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Graphical Models - Part I

CPSC 340: Machine Learning and Data Mining

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Multivariate Normal Models

Undirected Graphical Models

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Intelligent Systems (AI-2)

Probabilistic Graphical Models

Probabilistic and Bayesian Machine Learning

Directed Graphical Models or Bayesian Networks

Graphical Model Inference with Perfect Graphs

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

Message-Passing Algorithms for GMRFs and Non-Linear Optimization

Tightness of LP Relaxations for Almost Balanced Models

Bayesian Networks Inference with Probabilistic Graphical Models

Total positivity in Markov structures

Directed and Undirected Graphical Models

Learning Gaussian Graphical Models with Unknown Group Sparsity

The Ising model and Markov chain Monte Carlo

Introduction to Probabilistic Graphical Models

CPSC 540: Machine Learning

Markov Random Fields for Computer Vision (Part 1)

Active and Semi-supervised Kernel Classification

CPSC 540: Machine Learning

Intelligent Systems (AI-2)

Bayesian Learning in Undirected Graphical Models

10708 Graphical Models: Homework 2

Stat 521A Lecture 18 1

CS Lecture 19. Exponential Families & Expectation Propagation

Bayesian Networks to design optimal experiments. Davide De March

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

p L yi z n m x N n xi

Transcription:

CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018

Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j), wj ), i=1 j=1 which becomes a supervised learning problem for each feature j. Tabular parameterization is common but requires small number of parents. Gaussian belief networks use least squares (defines a multivariate Gaussian). Sigmoid belief networks use logistic regression. For inference in DAGs (decoding, computing marginals, computing conditionals): We can use ancestral sampling to compute Monte Carlo approximations. We can apply message passing, but messages may be huge. Only guarantee O(dk 2 ) cost if each node has at most one parent ( tree or forest ).

Conditional Sampling in DAGs What about conditional sampling in DAGs? Could be easy or hard depending on what we condition on. For example, easy if we condition on the first variables in the order: Just fix these and run ancestral sampling. Hard to condition on the last variables in the order: Conditioning on descendent makes ancestors dependent.

DAG Structure Learning Structure learning is the problem of choosing the graph. Input is data X. Output is a graph G. The easy case is when we re given the ordering of the variables. So the parents of j must be chosen from {1, 2,..., j 1}. Given the ordering, structure learning reduces to feature selection: Select features {x 1, x 2,..., x j 1 } that best predict label x j. We can use any feature selection method to solve these d problems.

Example: Structure Learning in Rain Data Given Ordering Structure learning in rain data using L1-regularized logistic regression. For different λ values, assuming chronological ordering.

DAG Structure Learning without an Ordering Without an ordering, a common approach is search and score Define a score for a particular graph structure (like BIC). Search through the space of possible DAGs (greedily add/remove/reverse edges). Another common approach is constraint-based methods: Based on performing a sequence of conditional independence tests. Prune edge between x i and x j if you find variables S making them independent, x i x j x S. Assumes faithfulness (all independences are reflected in graph). Otherwise it s weird (a duplicated feature would be disconnected from everything.) Structure learning is NP-hard in general, but finding the optimal tree is poly-time: For symmetric scores, can be done by minimum spanning tree. For asymetric scores, can be by minimum spanning arborescence.

Structure Learning on USPS Digits Optimal tree on USPS digits. 1,1 1,2 2,1 1,3 1,4 1,5 1,6 2,2 2,3 2,4 1,7 2,5 2,6 3,1 3,2 3,3 3,4 2,7 3,5 4,1 4,2 4,3 1,8 3,6 1,9 1,10 1,11 4,4 5,1 5,2 5,3 2,8 3,7 4,5 2,9 2,10 2,11 1,12 6,1 6,2 6,3 3,8 4,7 4,6 5,4 3,9 3,10 1,13 3,11 2,12 7,1 7,2 7,3 4,8 5,5 4,9 4,12 4,13 1,15 4,10 1,16 1,14 2,13 3,12 8,1 8,2 8,3 5,8 5,7 5,6 6,4 5,9 5,10 5,11 5,12 5,13 2,15 2,16 2,14 3,13 4,11 9,1 9,2 9,3 6,8 6,7 6,6 6,5 7,4 6,9 6,10 6,11 6,12 6,13 3,16 3,15 3,14 10,1 10,2 10,3 7,8 7,7 7,5 7,6 8,4 7,9 7,10 7,11 7,12 4,15 4,14 11,1 11,2 11,3 8,8 8,7 8,5 8,6 9,4 8,9 8,10 8,11 4,16 5,15 5,14 12,1 12,2 12,3 9,8 9,7 9,6 9,5 10,4 9,9 9,10 5,16 13,1 10,8 10,7 10,6 10,5 11,4 10,9 6,15 13,2 14,1 11,8 11,6 11,7 11,5 6,16 6,14 7,13 13,3 14,2 15,1 12,7 12,8 13,4 12,5 12,6 12,4 7,16 7,15 7,14 8,13 8,12 14,3 15,2 16,1 13,5 8,16 8,15 8,14 9,12 9,11 15,3 16,2 13,6 14,4 9,16 9,15 14,8 9,14 13,8 10,11 9,13 10,10 12,9 16,3 13,7 14,5 10,16 10,15 14,9 15,8 10,14 13,9 11,10 10,12 10,13 11,9 12,10 14,6 15,4 11,16 11,15 14,10 16,8 11,14 13,10 11,11 11,12 11,13 12,11 15,5 14,7 16,4 12,16 12,15 14,11 12,14 12,13 13,11 12,12 15,9 15,6 15,7 13,14 14,12 13,13 13,12 16,9 13,15 13,16 15,10 16,5 16,6 16,7 14,14 14,13 14,15 15,11 16,10 15,13 15,14 15,12 15,15 14,16 16,11 16,13 16,14 16,12 15,16 16,15 16,16

20 Newsgroups Data Data containing presence of 100 words from newsgroups posts: car drive files hockey mac league pc win 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 Structure learning should give relationship between words.

Structure Learning on News Words Optimal tree on newsgroups data: email ftp phone files number disk format windows drive memory system image card dos driver pc program version win car scsi data problem display graphics video software space team won bmw dealer engine honda mac help server launch moon nasa orbit shuttle technology fans games hockey league players puck season oil lunar mars earth satellite solar mission nhl baseball god hit bible christian jesus religion jews government israel children power president rights state war health human law university world aids food insurance medicine fact gun research science msg water patients studies case course evidence question computer cancer disease doctor vitamin

Outline 1 Structure Learning 2

Directed vs. Undirected Models In some applications we have a natural ordering of the x j. In the rain data, the past affects the future. In some applications we don t have a natural order. E.g., pixels in an image. In these settings we often use undirected graphical models (UGMs). Also known as Markov random fields (MRFs) and originally from statistical physics.

Directed vs. Undirected Models Undirected graphical models are based on undirected graphs: They are a classic way to model dependencies in images: Can capture dependencies between neighbours without imposing an ordering.

Ising Models from Statistical Physics The Ising model for binary x i is defined by d p(x 1, x 2,..., x d ) exp x i w i + x i x j w ij, i=1 (i,j) E where E is the set of edges in an undirected graph. Called a log-linear model, because log p(x) is linear plus a constant. Consider using x i { 1, 1}: If w i > 0 it encourages x i = 1. If w ij > 0 it encourages neighbours i and j to have the same value. E.g., neighbouring pixels in the image receive the same label ( attractive model) We re modeling dependencies, but haven t assumed an ordering.

Pairwise undirected graphical models (UGMs) assume p(x) has the form d p(x) φ j (x j ) φ ij (x i, x j ). j=1 (i,j) E The φ j and φ ij functions are called potential functions: They can be any non-negative function. Ordering doesn t matter: more natural for things like pixels of an image. Ising model is a special case where φ i (x i ) = exp(x i w i ), φ ij (x i, x j ) = exp(x i x j w ij ). Bonus slides generalize Ising to non-binary case.

Label Propagation as a UGM Consider modeling the probability of a vector of labels ȳ R t using n t p(ȳ 1, ȳ 2,..., ȳ t ) exp w ij (y i ȳ i ) 2 1 t t w ij (ȳ i ȳ j ) 2. 2 i=1 j=1 i=1 j=1 Decoding in this model is equivalent to the label propagation problem. This is a pairwise UGM: ( ) n φ j (ȳ j ) = exp w ij (y i ȳ j ) 2, φ ij (ȳ i, ȳ j ) = exp ( 12 ) w ij(ȳ i ȳ j ) 2. i=1

Conditional Independence in It s easy to check conditional independence in UGMs: A B C if C blocks all paths from any A to any B. Example: A C. A C B. A C B, E. A, B F C A, B F C, E.

Multivariate Gaussian and Pairwise Graphical Models Independence in multivariate Gaussian: In Gaussians, marginal independence is determined by covariance: x i x j Σ ij = 0. But how can we determine conditional independence? Multivarate Gaussian is a special case of a pairwise UGM. So we can just use graph separation. In particular, edges of the UGM are (i, j) values where Θ i,j 0. We use the term Gaussian graphical model (GGM) in this context. Or Gaussian Markov random field (GMRF).

Digression: Gaussian Graphical Models Multivariate Gaussian can be written as ( p(x) exp 1 ) 2 (x µ)t Σ 1 (x µ) exp 1 2 xt Σ 1 x + x T Σ 1 µ, }{{} v and writing it in summation notation we can see that it s a pairwise UGM: p(x) exp( 1 d d d x i x j Σ 1 ij + x i v i 2 i=1 j=1 i=1 d d ( = exp 1 ) 2 x ix j Σ 1 d ij exp (x i v i ) }{{} i=1 j=1 }{{} i=1 φ i(x i) φ ij(x i,x j)

Independence in GGMs So Gaussians are pairwise UGMs with φ ij (x i, x j ) = exp ( 1 2 x ix j Θ ij ), Where Θ ij is element (i, j) of Σ 1. Consider setting Θ ij = 0: For all (x i, x j ) we have φ(x i, x j ) = 1, which is equivalent to not having edge (i, j). So setting Θ ij = 0 is equivalent to removing φ ij (x i, x j ) from the UGM. Gaussian conditional independence is determined by precision matrix sparsity. Diagonal Θ gives disconnected graph: all variables are independent. Full Θ gives fully-connected graph: there are no independences.

Independence in GGMs Consider a Gaussian with the following covariance matrix: 0.0494 0.0444 0.0312 0.0034 0.0010 0.0444 0.1083 0.0761 0.0083 0.0025 Σ = 0.0312 0.0761 0.1872 0.0204 0.0062 0.0034 0.0083 0.0204 0.0528 0.0159 0.0010 0.0025 0.0062 0.0159 0.2636 Σ ij 0 so all variables are dependent: x 1 x 2, x 1 x 5, and so on. This would show up in graph: you would be able to reach any x i from any x j. The inverse is given by a tri-diagonal matrix: 32.0897 13.1740 0 0 0 13.1740 18.3444 5.2602 0 0 Σ 1 = 0 5.2602 7.7173 2.1597 0 0 0 2.1597 20.1232 1.1670 0 0 0 1.1670 3.8644 So conditional independence is described by a Markov chain: p(x 1 x 2, x 3, x 4, x 5 ) = p(x 1 x 2 ).

Graphical Lasso Conditional independence in GGMs is described by sparsity in Θ. Setting a Θ ij to 0 removes an edge from the graph. Recall fitting multivariate Gaussian with L1-regularization, argmin Tr(SΘ) log Θ + λ Θ 1, Θ 0 which is called the graphical Lasso because it encourages a sparse graph. Graphical Lasso is a convex approach to structure learning for GGMs. Examples https://normaldeviate.wordpress.com/2012/09/17/high-dimensional-undirected-graphical-models.

Higher-Order In UGMs, we can also define potentials on higher-order interactions. A three-variable generalization of Ising potentials is: φ ijk (x i, x j, x k ) = w ijk x i x j x k. If w ijk > 0 and x j {0, 1}, encourages you to set all three to 1. If w ijk > 0 and x j { 1, 1}, encourages odd number of positives. In the general case, a UGM just assumes p(x) factorizes over subsets c, p(x 1, x 2,..., x d ) c C φ c (x c ), from among a collection of subsets of C. In this case, graph has edge (i, j) if i and j are together in at least one c. Conditional independences are still given by graph separation.

Tractability of UGMs Without using, we write UGM probability as p(x) = 1 φ c (x c ), Z c C where Z is the constant that makes the probabilites sum up to 1. Z = φ c (x c ) or Z = φ c (x c )dx d dx d 1... dx 1 = 1. x 1 x 2 x 1 x 2 x d c C x d c C Whether you can compute Z depends on the choice of the φ c : Gaussian case: O(d 3 ) in general, but O(d) for forests (no loops). Continuous non-gaussian: usually requires numerical integration. Discrete case: #P-hard in general, but O(dk 2 ) for forests (no loops).

Summary Structure learning is the problem of learning the graph structure. Hard in general, but easy for trees and L1-regularization gives fast heuristic. Undirected graphical models factorize probability into non-negative potentials. Gaussians are a special case. Log-linear models (like Ising) are a common choice. Simple conditional independence properties. Next time: our first visit to the wild world of approximate inference.

General Pairwise UGM For general discrete x i a generalization of Ising models is p(x 1, x 2,..., x d ) = 1 d Z exp w i,xi + i=1 (i,j) E w i,j,xi,x j which can represent any positive pairwise UGM (meaning p(x) > 0 for all x)., Interpretation of weights for this UGM: If w i,1 > w i,2 then we prefer x i = 1 to x i = 2. If w i,j,1,1 > w i,j,2,2 then we prefer (x i = 1, x j = 1) to (x i = 2, x j = 2). As before, we can use parameter tieing: We could use the same w i,xi for all positions i. Ising model corresponds to a particular parameter tieing of the w i,j,xi,x j.

Factor Graphs Factor graphs are a way to visualize UGMs that distinguishes different orders. Use circles for variables, squares to represent dependencies. Factor graph if p(x 1, x 2, x 3 ) φ 12 (x 1, x 2 )φ 13 (x 1, x 2, x 3 )φ 23 (x 2, x 3 ): Factor graph if p(x 1, x 2, x 3 ) φ 123 (x 1, x 2, x 3 ):