MATH 829: Introduction to Data Mining and Analysis Graphical Models I

Similar documents
MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

MATH 829: Introduction to Data Mining and Analysis Clustering II

4.1 Notation and probability review

CPSC 540: Machine Learning

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

MATH 567: Mathematical Techniques in Data Science Clustering II

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

CSC 412 (Lecture 4): Undirected Graphical Models

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Conditional Independence and Markov Properties

Chapter 17: Undirected Graphical Models

Lecture 4 October 18th

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

Total positivity in Markov structures

Graphical Models and Independence Models

Undirected Graphical Models

Review: Directed Models (Bayes Nets)

Chris Bishop s PRML Ch. 8: Graphical Models

Chapter 4: Factor Analysis

3 Undirected Graphical Models

Tutorial: Gaussian conditional independence and graphical models. Thomas Kahle Otto-von-Guericke Universität Magdeburg

Markov properties for undirected graphs

Markov properties for undirected graphs

MATH 567: Mathematical Techniques in Data Science Clustering II

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Example: multivariate Gaussian Distribution

Algebra 2 CP Semester 1 PRACTICE Exam

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

On fat Hoffman graphs with smallest eigenvalue at least 3

Stat 451: Solutions to Assignment #1

Chapter 9: Relations Relations

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Probabilistic Graphical Models (I)

Factor Analysis. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Partitioned Covariance Matrices and Partial Correlations. Proposition 1 Let the (p + q) (p + q) covariance matrix C > 0 be partitioned as C = C11 C 12

A Z q -Fan theorem. 1 Introduction. Frédéric Meunier December 11, 2006

Bayesian Learning in Undirected Graphical Models

Likelihood Analysis of Gaussian Graphical Models

CSE548, AMS542: Analysis of Algorithms, Fall 2016 Date: Nov 30. Final In-Class Exam. ( 7:05 PM 8:20 PM : 75 Minutes )

MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new

Graphical Models with Symmetry

The doubly negative matrix completion problem

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

On the inverse matrix of the Laplacian and all ones matrix

CS281A/Stat241A Lecture 19

Areal Unit Data Regular or Irregular Grids or Lattices Large Point-referenced Datasets

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests

Decomposable and Directed Graphical Gaussian Models

Algebra 2 CP Semester 1 PRACTICE Exam January 2015

Probabilistic Graphical Models

Undirected Graphical Models 4 Bayesian Networks and Markov Networks. Bayesian Networks to Markov Networks

Multivariate Gaussians. Sargur Srihari

Notes 6 : First and second moment methods

Introduction to Graphical Models

Math 4377/6308 Advanced Linear Algebra

L03. PROBABILITY REVIEW II COVARIANCE PROJECTION. NA568 Mobile Robotics: Methods & Algorithms

A Probability Review

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

Approximation Algorithms for the k-set Packing Problem

β α : α β We say that is symmetric if. One special symmetric subset is the diagonal

MATH 829: Introduction to Data Mining and Analysis Least angle regression

Probabilistic Graphical Models. Rudolf Kruse, Alexander Dockhorn Bayesian Networks 153

Undirected Graphical Models: Markov Random Fields

Rapid Introduction to Machine Learning/ Deep Learning

Some Approximations of the Logistic Distribution with Application to the Covariance Matrix of Logistic Regression

Definition: A binary relation R from a set A to a set B is a subset R A B. Example:

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new

Math 42, Discrete Mathematics

Gibbs Fields & Markov Random Fields

A SINful Approach to Gaussian Graphical Model Selection

STAT 730 Chapter 9: Factor analysis

Department of Mathematics, University of California, Berkeley. GRADUATE PRELIMINARY EXAMINATION, Part A Fall Semester 2018

10708 Graphical Models: Homework 2

Asymptotic standard errors of MLE

Probabilistic Graphical Models

Algebraic Representations of Gaussian Markov Combinations

Preliminaries and Complexity Theory

Rigidity of Graphs and Frameworks

ACO Comprehensive Exam October 14 and 15, 2013

Bayesian & Markov Networks: A unified view

On the adjacency matrix of a block graph

Markov Random Fields

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

HOMEWORK #2 - MATH 3260

Factor Analysis (10/2/13)

SBS Chapter 2: Limits & continuity

3 : Representation of Undirected GM

Iterative Conditional Fitting for Gaussian Ancestral Graph Models

Math 3191 Applied Linear Algebra

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Modularity and Graph Algorithms

Combinatorial semigroups and induced/deduced operators

9 Graphical modelling of dynamic relationships in multivariate time series

Relations Graphical View

STAT 100C: Linear models

MATH 13 SAMPLE FINAL EXAM SOLUTIONS

Transcription:

MATH 829: Introduction to Data Mining and Analysis Graphical Models I Dominique Guillot Departments of Mathematical Sciences University of Delaware May 2, 2016 1/12

Independence and conditional independence: motivation We begin with a classical example (Whittaker, 1990): We study the examination marks of 88 students in ve subjects: mechanics, vectors, algebra, analysis, statistics (Mardia, Kent, and Bibby, 1979). Mechanics and vectors were closed books. All the remaining exams were open books. 2/12

Independence and conditional independence: motivation We begin with a classical example (Whittaker, 1990): We study the examination marks of 88 students in ve subjects: mechanics, vectors, algebra, analysis, statistics (Mardia, Kent, and Bibby, 1979). Mechanics and vectors were closed books. All the remaining exams were open books. We can examine the results using a stem and leaf plot. algebra statistics 0-9 10-45778889 20-1 012455699 30-1266677889 0011122333344556677799 40-0113333345566667777888999999 00000001123444555556679 50-000000111223333344455666778899 0113346 60-000111123455578 11233447888 70-12 033 80-0 1111 90-2/12

Independence and conditional independence: motivation We begin with a classical example (Whittaker, 1990): We study the examination marks of 88 students in ve subjects: mechanics, vectors, algebra, analysis, statistics (Mardia, Kent, and Bibby, 1979). Mechanics and vectors were closed books. All the remaining exams were open books. We can examine the results using a stem and leaf plot. algebra statistics 0-9 10-45778889 20-1 012455699 30-1266677889 0011122333344556677799 40-0113333345566667777888999999 00000001123444555556679 50-000000111223333344455666778899 0113346 60-000111123455578 11233447888 70-12 033 80-0 1111 90- Note: Data appears to be normally distributed. 2/12

Example (cont.) We compute the correlation between the grades of the students: mech 1.0 vect 0.55 1.0 alg 0.55 0.61 1.0 anal 0.41 0.49 0.71 1.0 stat 0.39 0.44 0.66 0.61 1.0 mech vect alg anal stat 3/12

Example (cont.) We compute the correlation between the grades of the students: mech 1.0 vect 0.55 1.0 alg 0.55 0.61 1.0 anal 0.41 0.49 0.71 1.0 stat 0.39 0.44 0.66 0.61 1.0 mech vect alg anal stat We observe that the grades are positively correlated between subjects (performance of a student across subjects (good or bad) is consistent). 3/12

Example (cont.) We compute the correlation between the grades of the students: mech 1.0 vect 0.55 1.0 alg 0.55 0.61 1.0 anal 0.41 0.49 0.71 1.0 stat 0.39 0.44 0.66 0.61 1.0 mech vect alg anal stat We observe that the grades are positively correlated between subjects (performance of a student across subjects (good or bad) is consistent). We now examine the inverse correlation matrix: mech 1.60 vect -0.56 1.80 alg -0.51-0.66 3.04 anal 0.00-0.15-1.11 2.18 stat -0.04-0.04-0.86-0.52 1.92 mech vect alg anal stat 3/12

Example (cont.) Interpreting the inverse correlation matrix: 4/12

Example (cont.) Interpreting the inverse correlation matrix: Diagonal entries = 1/(1 R 2 ) are related to the proportion of variance explained by regressing the variable on the other variables. 4/12

Example (cont.) Interpreting the inverse correlation matrix: Diagonal entries = 1/(1 R 2 ) are related to the proportion of variance explained by regressing the variable on the other variables. O-diagonal entries: proportional to the correlation of pairs of variables, given the rest of the variables. 4/12

Example (cont.) Interpreting the inverse correlation matrix: Diagonal entries = 1/(1 R 2 ) are related to the proportion of variance explained by regressing the variable on the other variables. O-diagonal entries: proportional to the correlation of pairs of variables, given the rest of the variables. For example, Rmech 2 = (1.60 1)/1.60 = 37.5%. 4/12

Example (cont.) Interpreting the inverse correlation matrix: Diagonal entries = 1/(1 R 2 ) are related to the proportion of variance explained by regressing the variable on the other variables. O-diagonal entries: proportional to the correlation of pairs of variables, given the rest of the variables. For example, Rmech 2 = (1.60 1)/1.60 = 37.5%. For the o-diagonal entries, we rst scale the inverse correlation matrix ω Ω = (ω ij ) by computing ij ωiiω jj : 4/12

Example (cont.) Interpreting the inverse correlation matrix: Diagonal entries = 1/(1 R 2 ) are related to the proportion of variance explained by regressing the variable on the other variables. O-diagonal entries: proportional to the correlation of pairs of variables, given the rest of the variables. For example, Rmech 2 = (1.60 1)/1.60 = 37.5%. For the o-diagonal entries, we rst scale the inverse correlation matrix ω Ω = (ω ij ) by computing ij ωiiω jj : mech 1.0 vect -0.33 1.0 alg -0.23-0.28 1.0 anal 0.00-0.08-0.43 1.0 stat -0.02-0.02-0.36-0.25 1.0 mech vect alg anal stat 4/12

Example (cont.) Interpreting the inverse correlation matrix: Diagonal entries = 1/(1 R 2 ) are related to the proportion of variance explained by regressing the variable on the other variables. O-diagonal entries: proportional to the correlation of pairs of variables, given the rest of the variables. For example, Rmech 2 = (1.60 1)/1.60 = 37.5%. For the o-diagonal entries, we rst scale the inverse correlation matrix ω Ω = (ω ij ) by computing ij ωiiω jj : mech 1.0 vect -0.33 1.0 alg -0.23-0.28 1.0 anal 0.00-0.08-0.43 1.0 stat -0.02-0.02-0.36-0.25 1.0 mech vect alg anal stat The o-diagonal entries of the scaled inverse correlation matrix are the negative of the conditional correlation coecients (i.e., the correlation coecients after conditioning on the rest of the variables). 4/12

Example (cont.) Notation: 5/12

Example (cont.) Notation: We denote the fact that two random variables X and Y are independent by X Y. 5/12

Example (cont.) Notation: We denote the fact that two random variables X and Y are independent by X Y. We write X Y {Z 1,..., Z n } when X and Y are independent given Z 1,..., Z n. 5/12

Example (cont.) Notation: We denote the fact that two random variables X and Y are independent by X Y. We write X Y {Z 1,..., Z n } when X and Y are independent given Z 1,..., Z n. When the context is clear (i.e. when working with a xed collection of random variables {X 1,..., X n }, we write X i X j rest instead of X i X j {X k : 1 k n, k i, j}. 5/12

Example (cont.) Notation: We denote the fact that two random variables X and Y are independent by X Y. We write X Y {Z 1,..., Z n } when X and Y are independent given Z 1,..., Z n. When the context is clear (i.e. when working with a xed collection of random variables {X 1,..., X n }, we write X i X j rest instead of X i X j {X k : 1 k n, k i, j}. Important: In general, uncorrelated variables are not independent. This is true however for the multivariate Gaussian distribution. 5/12

Example (cont.) We noted before that our data appears to be Gaussian. 6/12

Example (cont.) We noted before that our data appears to be Gaussian. Therefore it appears that: 1 anal mech rest. 2 anal vect rest. 3 stat mech rest. 4 stat vect rest. 6/12

Example (cont.) We noted before that our data appears to be Gaussian. Therefore it appears that: 1 anal mech rest. 2 anal vect rest. 3 stat mech rest. 4 stat vect rest. We represent these relations using a graph: 6/12

Example (cont.) We noted before that our data appears to be Gaussian. Therefore it appears that: 1 anal mech rest. 2 anal vect rest. 3 stat mech rest. 4 stat vect rest. We represent these relations using a graph: 6/12

Example (cont.) We noted before that our data appears to be Gaussian. Therefore it appears that: 1 anal mech rest. 2 anal vect rest. 3 stat mech rest. 4 stat vect rest. We represent these relations using a graph: We put no edge between two variables i they are conditionally independent (given the rest of the variables). 6/12

Independence and factorizations Graphical models (a.k.a Markov random elds) are multivariate probability models whose independence structure is characterized by a graph. 7/12

Independence and factorizations Graphical models (a.k.a Markov random elds) are multivariate probability models whose independence structure is characterized by a graph. Recall: Independence of random vectors is characterized by a factorization of their joint density: 7/12

Independence and factorizations Graphical models (a.k.a Markov random elds) are multivariate probability models whose independence structure is characterized by a graph. Recall: Independence of random vectors is characterized by a factorization of their joint density: Independent variables: For two random vectors X, Y : X Y f X,Y (x, y) = g(x)h(y) x, y for some functions g, h. 7/12

Independence and factorizations Graphical models (a.k.a Markov random elds) are multivariate probability models whose independence structure is characterized by a graph. Recall: Independence of random vectors is characterized by a factorization of their joint density: Independent variables: For two random vectors X, Y : X Y f X,Y (x, y) = g(x)h(y) x, y for some functions g, h. Conditionally independent variables: random vectors X, Y, Z: Similarly, for three X Y Z f X,Y,Z (x, y, z) = g(x, z)h(y, z) for all x, y and all z for which f Z (z) > 0. 7/12

Independence graphs Let X = (X 1,..., X p ) be a random vector. 8/12

Independence graphs Let X = (X 1,..., X p ) be a random vector. The conditional independence graph of X is the graph G = (V, E) where V = {1,..., p} and (i, j) E X i X j rest. 8/12

Independence graphs Let X = (X 1,..., X p ) be a random vector. The conditional independence graph of X is the graph G = (V, E) where V = {1,..., p} and (i, j) E X i X j rest. A subset S V is said to separate A V from B V if every path from A to B contains a vertex in S. 8/12

Independence graphs Let X = (X 1,..., X p ) be a random vector. The conditional independence graph of X is the graph G = (V, E) where V = {1,..., p} and (i, j) E X i X j rest. A subset S V is said to separate A V from B V if every path from A to B contains a vertex in S. Notation: If X = (X 1,..., X p ) and A {1,..., p}, then X A := (X i : i A). 8/12

Independence graphs Let X = (X 1,..., X p ) be a random vector. The conditional independence graph of X is the graph G = (V, E) where V = {1,..., p} and (i, j) E X i X j rest. A subset S V is said to separate A V from B V if every path from A to B contains a vertex in S. Notation: If X = (X 1,..., X p ) and A {1,..., p}, then X A := (X i : i A). Theorem: (the separation theorem) Suppose the density of X is positive and continuous. Let V = A S B be a partition of V such that S separates A from B. Then X A X B X S. 8/12

Independence graphs (cont.) Example: X = (X 1, X 2, X 3, X 4, X 5, X 6, X 7, X 8, X 9 ): 9/12

Independence graphs (cont.) Example: X = (X 1, X 2, X 3, X 4, X 5, X 6, X 7, X 8, X 9 ): Then (X 1, X 2, X 3, X 4 ) (X 7, X 8, X 9 ) (X 5, X 6 ). 9/12

Markov properties Let X = (X 1,..., X p ) be a random vector and let G be a graph on {1,..., p}. The vector is said to satisfy: 10/12

Markov properties Let X = (X 1,..., X p ) be a random vector and let G be a graph on {1,..., p}. The vector is said to satisfy: 1 The pairwise Markov property if X i X j rest whenever (i, j) E. 10/12

Markov properties Let X = (X 1,..., X p ) be a random vector and let G be a graph on {1,..., p}. The vector is said to satisfy: 1 The pairwise Markov property if X i X j rest whenever (i, j) E. 2 The local Markov property if for every vertex i V, X i X V \cl(i) X ne(i), where ne(i) = {j V : (i, j) E, j i} and cl(i) = {i} ne(i). 10/12

Markov properties Let X = (X 1,..., X p ) be a random vector and let G be a graph on {1,..., p}. The vector is said to satisfy: 1 The pairwise Markov property if X i X j rest whenever (i, j) E. 2 The local Markov property if for every vertex i V, X i X V \cl(i) X ne(i), where ne(i) = {j V : (i, j) E, j i} and cl(i) = {i} ne(i). 3 The global Markov property if for every disjoint subsets A, S, B V such that S separates A from B in G, we have X A X B X S. 10/12

Markov properties Let X = (X 1,..., X p ) be a random vector and let G be a graph on {1,..., p}. The vector is said to satisfy: 1 The pairwise Markov property if X i X j rest whenever (i, j) E. 2 The local Markov property if for every vertex i V, X i X V \cl(i) X ne(i), where ne(i) = {j V : (i, j) E, j i} and cl(i) = {i} ne(i). 3 The global Markov property if for every disjoint subsets A, S, B V such that S separates A from B in G, we have X A X B X S. Clearly, global local pairwise. 10/12

Markov properties Let X = (X 1,..., X p ) be a random vector and let G be a graph on {1,..., p}. The vector is said to satisfy: 1 The pairwise Markov property if X i X j rest whenever (i, j) E. 2 The local Markov property if for every vertex i V, X i X V \cl(i) X ne(i), where ne(i) = {j V : (i, j) E, j i} and cl(i) = {i} ne(i). 3 The global Markov property if for every disjoint subsets A, S, B V such that S separates A from B in G, we have X A X B X S. Clearly, global local pairwise. When X has a positive and continuous density, by the separation theorem, pairwise global and so all three properties are equivalent. 10/12

Example: the local Markov property Illustration of the local Markov property: 11/12

The HammersleyCliord theorem An undirected graphical model (a.k.a. Markov random eld) is a set of random variables satisfying a Markov property. 12/12

The HammersleyCliord theorem An undirected graphical model (a.k.a. Markov random eld) is a set of random variables satisfying a Markov property. Independence and conditional independence correspond to a factorization of the joint density. 12/12

The HammersleyCliord theorem An undirected graphical model (a.k.a. Markov random eld) is a set of random variables satisfying a Markov property. Independence and conditional independence correspond to a factorization of the joint density. It is natural to try to characterize Markov properties via a factorization of the joint density. 12/12

The HammersleyCliord theorem An undirected graphical model (a.k.a. Markov random eld) is a set of random variables satisfying a Markov property. Independence and conditional independence correspond to a factorization of the joint density. It is natural to try to characterize Markov properties via a factorization of the joint density. The HammersleyCliord theorem provides a necessary and sucient condition for a random vector to have a Markov random eld structure. 12/12

The HammersleyCliord theorem An undirected graphical model (a.k.a. Markov random eld) is a set of random variables satisfying a Markov property. Independence and conditional independence correspond to a factorization of the joint density. It is natural to try to characterize Markov properties via a factorization of the joint density. The HammersleyCliord theorem provides a necessary and sucient condition for a random vector to have a Markov random eld structure. Theorem:(HammersleyCliord) Let X be a random vector with a positive and continuous density f. Then X satises the pairwise Markov property with respect to a graph G if and only if f(x) = C C ψ C (x C ), where C is the set of (maximal) cliques (complete subgraphs) of G. 12/12