Bayesian networks Lecture 18. David Sontag New York University

Bayesian networks Lecture 18 David Sontag New York University

Outline for today Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) Bayesian networks Independence proper=es Examples Learning and inference

Example applica=on: Tracking Observe noisy measurements of missile loca=on: Y 1, Y 2, Where is the missile now? Where will it be in 10 seconds? Radar

Probabilis=c approach Our measurements of the missile loca=on were Y 1, Y 2,, Y n Let X t be the true <missile loca=on, velocity> at =me t To keep this simple, suppose that everything is discrete, i.e. X t takes the values 1,, k Grid the space:

Probabilis=c approach First, we specify the condi&onal distribu=on Pr(X t X t-1 ): From basic physics, we can bound the distance that the missile can have traveled Then, we specify Pr(Y t X t =<(10,20), 200 mph toward the northeast>): With probability ½, Y t = X t (ignoring the velocity). Otherwise, Y t is a uniformly chosen grid loca=on

Hidden Markov models 1960 s Assume that the joint distribu=on on X 1, X 2,, X n and Y 1, Y 2,, Y n factors as follows: Pr(x 1,...x n,y 1,...,y n )=Pr(x 1 )Pr(y 1 x 1 ) ny Pr(x t x t 1 )Pr(y t x t ) To find out where the missile is now, we do marginal inference: Pr(x n y 1,...,y n ) To find the most likely trajectory, we do MAP (maximum a posteriori) inference: t=2 arg max x Pr(x 1,...,x n y 1,...,y n )

Inference Recall, to find out where the missile is now, we do marginal inference: Pr(x n y 1,...,y n ) How does one compute this? Applying rule of condi=onal probability, we have: Pr(x n y 1,...,y n )= Pr(x n,y 1,...,y n ) Pr(y 1,...,y n ) = Pr(x n,y 1,...,y n ) P kˆx n =1 Pr(ˆx n,y 1,...,y n ) Naively, would seem to require k n-1 summa=ons, Pr(x n,y 1,...,y n )= X x 1,...,x n 1 Pr(x 1,...,x n,y 1,...,y n ) Is there a more efficient algorithm?

Marginal inference in HMMs Use dynamic programming Pr(x n,y 1,...,y n )= X x n 1 Pr(x n 1,x n,y 1,...,y n ) Pr(A = a) = X b = X x n 1 Pr(x n 1,y 1,...,y n 1 )Pr(x n,y n x n 1,y 1,...,y n 1 ) Pr(B = b, A = a) Pr(A = a, B = b) =Pr(A = a)pr(b = b A = a) = X Condi=onal independence in HMMs Pr(x n 1,y 1,...,y n 1 )Pr(x n,y n x n 1 ) x n 1 = X Pr(A = a, B = b) =Pr(A = a)pr(b = b A = a) Pr(x n 1,y 1,...,y n 1 )Pr(x n x n 1 )Pr(y n x n,x n 1 ) x n 1 = X Condi=onal independence in HMMs Pr(x n 1,y 1,...,y n 1 )Pr(x n x n 1 )Pr(y n x n ) x n 1 For n=1, ini=alize Pr(x 1,y 1 )=Pr(x 1 )Pr(y 1 x 1 ) Total running =me is O(nk) linear =me! Easy to do filtering

MAP inference in HMMs MAP inference in HMMs can also be solved in linear =me! arg max x Pr(x 1,...x n y 1,...,y n ) = arg max x Pr(x 1,...x n,y 1,...,y n ) = arg max log Pr(x 1,...x n,y 1,...,y n ) x h i = arg max log Pr(x 1 )Pr(y 1 x 1 ) + x nx h log i=2 Formulate as a shortest paths problem Weight h for edge (s, x 1 ) is i - log Pr(x 1 )Pr(y 1 x 1 ) s Weight for edge (x i-1, x i ) is - i Pr(x i x i 1 )Pr(y i x i ) h i log Pr(x i x i 1 )Pr(y i x i ) Path from s to t gives the MAP assignment t Weight for edge (x n, t) is 0 k nodes per variable X 1 X 2 X n-1 X n Called the Viterbi algorithm

Applica=ons of HMMs Speech recogni=on Predict phonemes from the sounds forming words (i.e., the actual signals) Natural language processing Predict parts of speech (verb, noun, determiner, etc.) from the words in a sentence Computa=onal biology Predict intron/exon regions from DNA Predict protein structure from DNA (locally) And many many more!

HMMs as a graphical model We can represent a hidden Markov model with a graph: X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Shading in denotes observed variables (e.g. what is available at test =me) ny Pr(x 1,...x n,y 1,...,y n )=Pr(x 1 )Pr(y 1 x 1 ) Pr(x t x t 1 )Pr(y t x t ) t=2 There is a 1-1 mapping between the graph structure and the factoriza=on of the joint distribu=on

Naïve Bayes as a graphical model We can represent a naïve Bayes model with a graph: Label Y X 1 X 2 X 3... X n Shading in denotes observed variables (e.g. what is available at test =me) Features ny Pr(y, x 1,...,x n )=Pr(y) Pr(x i y) i=1 There is a 1-1 mapping between the graph structure and the factoriza=on of the joint distribu=on

Bayesian networks A Bayesian network is specified by a directed acyclic graph G=(V,E) with: One node i for each random variable X i One condi=onal probability distribu=on (CPD) per node, p(x i x Pa(i) ), specifying the variable s probability condi=oned on its parents values Corresponds 1-1 with a par=cular factoriza=on of the joint distribu=on: p(x 1,...x n )= Y p(x i x Pa(i) ) i2v Powerful framework for designing algorithms to perform probability computa=ons

2011 Turing award was for Bayesian networks

Example Consider the following Bayesian network: d 0 d 1 0.6 0.4 i 0 i 1 0.7 0.3 i 0,d 0 i 0,d 1 i 0,d 0 i 0,d 1 Difficulty What is its joint distribu=on? g 1 0.3 0.05 0.9 0.5 g 2 g 3 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2 g 1 g 2 g 2 Grade Letter l 0 0.1 0.4 0.99 Intelligence l 1 0.9 0.6 0.01 i 0 i 1 SAT s 0 0.95 0.2 s 1 0.05 0.8 Example from Koller & Friedman, Probabilis&c Graphical Models, 2009 p(x 1,...x n ) = Y i2v p(x i x Pa(i) ) p(d, i, g, s, l) = p(d)p(i)p(g i, d)p(s i)p(l g)

Example Consider the following Bayesian network: d 0 d 1 0.6 0.4 i 0 i 1 0.7 0.3 Difficulty What is this model assuming? SAT 6? Grade i 0,d 0 i 0,d 1 i 0,d 0 i 0,d 1 g 1 0.3 0.05 0.9 0.5 g 2 g 3 0.4 0.25 0.08 0.3 SAT? Grade Intelligence 0.3 0.7 0.02 0.2 g 1 g 2 g 2 Grade Letter l 0 0.1 0.4 0.99 Intelligence l 1 0.9 0.6 0.01 i 0 i 1 SAT s 0 0.95 0.2 s 1 0.05 0.8 Example from Koller & Friedman, Probabilis&c Graphical Models, 2009

Example Consider the following Bayesian network: d 0 d 1 0.6 0.4 i 0 i 1 0.7 0.3 i 0,d 0 i 0,d 1 i 0,d 0 i 0,d 1 g 1 0.3 0.05 0.9 0.5 g 2 g 3 0.4 0.25 0.08 0.3 0.3 0.7 Difficulty 0.02 0.2 Grade Letter Intelligence Compared to a simple log-linear model to predict intelligence: Captures non-linearity between grade, course difficulty, and intelligence Modular. Training data can come from different sources! g 1 g 2 g 2 l 0 0.1 0.4 0.99 Built in feature selec&on: lerer of recommenda=on is irrelevant given grade l 1 0.9 0.6 0.01 i 0 i 1 SAT s 0 0.95 0.2 s 1 0.05 0.8 Example from Koller & Friedman, Probabilis&c Graphical Models, 2009

Bayesian networks enable use of domain knowledge Will my car start this morning? p(x 1,...x n )= Y i2v p(x i x Pa(i) ) Heckerman et al., Decision-Theore=c Troubleshoo=ng, 1995

Bayesian networks enable use of domain knowledge p(x 1,...x n )= Y i2v What is the differen=al diagnosis? p(x i x Pa(i) ) Beinlich et al., The ALARM Monitoring System, 1989

Bayesian networks are genera&ve models Can sample from the joint distribu=on, top-down Suppose Y can be spam or not spam, and X i is a binary indicator of whether word i is present in the e-mail Let s try genera=ng a few emails! Label Y X 1 X 2 X 3... X n Features Oven helps to think about Bayesian networks as a genera=ve model when construc=ng the structure and thinking about the model assump=ons

Inference in Bayesian networks Compu=ng marginal probabili=es in tree structured Bayesian networks is easy The algorithm called belief propaga=on generalizes what we showed for hidden Markov models to arbitrary trees X 1 X 2 X 3 X 4 X 5 X 6 Label Y X 1 X 2 X 3... X n Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Features Wait this isn t a tree! What can we do?

Inference in Bayesian networks In some cases (such as this) we can transform this into what is called a junc=on tree, and then run belief propaga=on

Approximate inference There is also a wealth of approximate inference algorithms that can be applied to Bayesian networks such as these Markov chain Monte Carlo algorithms repeatedly sample assignments for es=ma=ng marginals Varia=onal inference algorithms (determinis=c) find a simpler distribu=on which is close to the original, then compute marginals using the simpler distribu=on

Maximum likelihood es=ma=on in Bayesian networks Suppose that we know the Bayesian network structure G Let xi x pa(i) be the parameter giving the value of the CPD p(x i x pa(i) ) Maximum likelihood estimation corresponds to solving: 1 MX max Xlog p(x M ; ) M m=1 subject to the non-negativity and normalization constraints This is equal to: 1 XMX max log p(x M ; ) = max M m=1 = max 1 M XNX i=1 X MX XNX m=1 1 M i=1 XMX m=1 log p(x M i x M pa(i) ; ) log p(x M i x M pa(i) ; ) The optimization problem decomposes into an independent optimization problem for each CPD! Has a simple closed-form solution.