Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Machine learning: lecture 20 ommi. Jaakkola MI CAI tommi@csail.mit.edu

opics Representation and graphical models examples Bayesian networks examples, specification graphs and independence associated distribution ommi Jaakkola, MI CAI 2

What is a good representation? Properties of good representations 1. Explicit 2. Modular 3. Permits efficient computation 4. etc. ommi Jaakkola, MI CAI 3

Representation: explicit Representation in terms of variables and dependencies (a graphical model): s 1 s 2 s 3 s 4 Representation in terms of state transitions (transition diagram) P (s 2 s 1 ) P (s 3 s 2 ) P (s 1 )......... s 1 s 2 s 3 ommi Jaakkola, MI CAI 4

Representation: modular We can easily add/remove components of the model Markov model s 1 s 2 s 3 s 4 Hidden Markov model s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 ommi Jaakkola, MI CAI 5

Representation: efficient computation s 2 s 1 2 s 3 1 x 1 x 2 Posterior marginals (forward-backward) Max-probabilities (viterbi) x 3 ommi Jaakkola, MI CAI 6

Graphical models: examples Factorial Hidden Markov model as a Bayesian network (directed graphical model)... linguistic features...... acoustic observations ommi Jaakkola, MI CAI 7

Graphical models: examples Plates and repeated sampling topics his paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled words M class his paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled his paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled his paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled his paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled each document has words, sampled from a distribution that depends on the choice of topics the topics for each document are sampled from a class conditional distribution ommi Jaakkola, MI CAI 8

Graphical models: examples attice models (e.g., Ising model) as a Markov random field s 1 s 2...... symmetric interactions (e.g., alignment of two nearby spins is energetically favorable)... ommi Jaakkola, MI CAI 9

Graphical models: examples Factor graphs and codes (information theory) y 1 y 2 y 3 y 4 y 5... x 1 x 2 x 3 x 4 x 5 Bits... parity checks circles denote variables while the squares are factors (functions) that constrain the values of the variables... ommi Jaakkola, MI CAI 10

linguistic features s 1 s 2 Graphical models acoustic observations............... M topics words class y 1 y 2 y 3 y 4 y 5... x 1 x 2 x 3 x 4 x 5 Bits...... parity checks... Graph semantics: graph separation properties independence Association with probability distributions: independence family of distributions Inference and estimation: graph structure efficient computation ommi Jaakkola, MI CAI 11

Bayesian networks Bayesian networks are directed acyclic graphs, where the nodes represent variables and directed edges capture dependencies "parent of x" A mixture model as a Bayesian network "i influences x" "i causes x" "x depends on i" i x P (i)p (x i) Graph semantics: "child of i" graph separation properties independence Association with probability distributions: independence family of distributions ommi Jaakkola, MI CAI 13

Example A simple Bayesian network: coin tosses x 1 x 2 ommi Jaakkola, MI CAI 14

Example A simple Bayesian network: coin tosses P (x 1 ) : 0.5 0.5 x 1 x 2 P (x2 ) : 0.5 0.5 ommi Jaakkola, MI CAI 15

Example A simple Bayesian network: coin tosses P (x 1 ) : 0.5 0.5 x 1 x 2 P (x2 ) : 0.5 0.5 x 3 = same? ommi Jaakkola, MI CAI 16

Example A simple Bayesian network: coin tosses P (x 1 ) : 0.5 0.5 x 1 x 2 P (x2 ) : 0.5 0.5 P (x 3 x 1, x 2 ) : x 3 = same? hh ht th tt y 1.0 0.0 0.0 1.0 n 0.0 1.0 1.0 0.0 ommi Jaakkola, MI CAI 17

Example A simple Bayesian network: coin tosses P (x 1 ) : 0.5 0.5 x 1 x 2 P (x2 ) : 0.5 0.5 x 3 = same? hh ht th tt y 1.0 0.0 0.0 1.0 P (x 3 x 1, x 2 ) : n 0.0 1.0 1.0 0.0 wo levels of description 1. graph structure (dependencies, independencies) 2. associated probability distribution ommi Jaakkola, MI CAI 18

Example cont d What can the graph alone tell us? x 1 x 2 x 3 = same? ommi Jaakkola, MI CAI 19

Example cont d What can the graph alone tell us? x 1 x 2 x 3 = same? x 1 and x 2 are marginally independent ommi Jaakkola, MI CAI 20

Example cont d What can the graph alone tell us? x 1 x 2 x 3 = same? x 1 and x 2 are marginally independent x 1 x 2 x 3 = same? x 1 and x 2 become dependent if we know x 3 (the dependence concerns our beliefs about the outcomes) ommi Jaakkola, MI CAI 21

raffic example = X is nice? = traffic light = X decides to stop? = the other car turns left? C = crash? C ommi Jaakkola, MI CAI 22

raffic example = X is nice? = traffic light = X decides to stop? = the other car turns left? C = crash? C ommi Jaakkola, MI CAI 23

raffic example = X is nice? = traffic light = X decides to stop? = the other car turns left? C = crash? C ommi Jaakkola, MI CAI 24

raffic example = X is nice? = traffic light = X decides to stop? = the other car turns left? C = crash? C ommi Jaakkola, MI CAI 25

raffic example = X is nice? = traffic light = X decides to stop? = the other car turns left? C = crash? C ommi Jaakkola, MI CAI 26

raffic example = X is nice? = traffic light = X decides to stop? = the other car turns left? C = crash? C If we only know that X decided to stop, can X s character (variable ) tell us anything about the other car turning (variable )? ommi Jaakkola, MI CAI 27

Graph, independence, d-separation Are and independent given? C ommi Jaakkola, MI CAI 28

Graph, independence, d-separation Are and independent given? Definition: Variables and are D-separated given if separates them in the moralized ancestral graph C ommi Jaakkola, MI CAI 29

Graph, independence, d-separation Are and independent given? Definition: Variables and are D-separated given if separates them in the moralized ancestral graph C C original ommi Jaakkola, MI CAI 30

Graph, independence, d-separation Are and independent given? Definition: Variables and are D-separated given if separates them in the moralized ancestral graph C C original ancestral ommi Jaakkola, MI CAI 31

Graphs and distributions A graph is a compact representation of a large collection of independence properties C ommi Jaakkola, MI CAI 34

Graphs and distributions A graph is a compact representation of a large collection of independence properties heorem: Any probability distribution that is consistent with a directed graph G has to factor according to node given parents : d P (x G) = P (x i x pai ) i=1 where x pai are the parents of x i and d is the number of nodes (variables) in the graph. C ommi Jaakkola, MI CAI 35

Model Explaining away phenomenon Earthquake Burglary Radio report Alarm ommi Jaakkola, MI CAI 36

Model Explaining away phenomenon Earthquake Burglary Radio report Alarm Evidence, competing causes Earthquake Burglary Radio report Alarm ommi Jaakkola, MI CAI 37

Model Explaining away phenomenon Earthquake Burglary Radio report Alarm Evidence, competing causes Earthquake Burglary Radio report Alarm Additional evidence and explaining away Earthquake Burglary Radio report Alarm ommi Jaakkola, MI CAI 38