A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Size: px

Start display at page:

Download "A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace"

Noah Cummings
5 years ago
Views:

1 A.I. in health informatics lecture 8 structured learning kevin small & byron wallace

2 today models for structured learning: HMMs and CRFs structured learning is particularly useful in biomedical applications: parsing (clinical) text; genetic data, etc. we ll cover this in more detail; but need the basics first

3 unstructured learning assumptions: we re given a set of i.i.d. instances {x 1, x 2, x N } and their (univariate) labels {y 1, y 2, y N } no order or sequence to the data

4 unstructured learning: graphically y 1 y 2 y 3 x 1 x 2 x 3

5 structured learning assumptions: there is some correlation between a label y i and the preceding labels, in other words, y i is structured, ie., y i+1 is affected by the previous labels y i, y i-1

6 structured learning consider the task of part-of-speech (POS) tagging sentences - nouns tend to follow verbs image from:

7 structured learning: graphically y 1 y 2 y 3 x 1 x 2 x 3

8 probability & structured learning p(x 1,x 2,...,x n ) = p(x i x 1,..., x i-1 ) n i=1 intractable!

9 (first-order) HMM y 1 y 2 y 3 x 1 x 2 x 3 * usually assume states (ys) are latent; but not always (see weather example, upcoming)

10 (first-order) MM p(x 1,x 2,...,x n ) = p(x 1 ) p(x i x i-1 ) n i=2 tractable!

11 markov model A 22 A 21 k = 2 A 12 A 32 A 23 k = 1 A 11 k = 3 A 31 A 13 A 33 from Bishop PMLR

12 markov model: unfolded k = 1 A 11 A 11 A 11 k = 2 k = 3 A 33 A 33 A 33 n 2 n 1 n n + 1 from Bishop PRML

13 markov model a ij probability of transitioning from state i to j - (we have to go somewhere!) - a ij =1 j a ij 0

14 let s talk about the weather the world has three states: 1 rainy, 2 cloudy, 3 sunny (in the weather case the markov model is not hidden) our transition matrix is: note that this system likes to stay where it is!

15 the weather probability that of observing the weather {sunny, sunny, rainy}? R C S = 1.0 * P(S S)*P(R S) =.8*.1

16 the weather it s sunny today. what s the probability that it remains so for k days? R C S = (.8) k

17 hidden markov models in weather example, we only care about transitions the (weather) states were the observations often we need to model data in which an observation is generated conditional on some latent, underlying state e.g., bias coin example; urn-genie example enter the HMM

18 the urn example the actual urn the genie draws from is unobserved we only know the sequence of draws! a genie (?!?) is in the room choosing urns to draw balls from each earn contains different proportions of the various colored balls

19 hidden markov models: sufficient parameters N states (latent), M symbols (observed): symbols are observed conditioned on the current state A - transition probabilities (from urn to urn) B symbol emission probabilities (color proportions in each urn) π initial state distribution (initial urn likelihood) λ = (A, B, π) specifies our model

20 hidden markov models: the three problems 1 given a set of observations o = {o 1, o 2 o T } and a model λ, compute P(o λ) 2 given o, λ, calculate most likely latent states (usually thought of as labels) q = {q 1, q 2 q T } 3 given o, calculate λ

21 hidden markov models: problem 1 1 given a set of observations o = {o 1, o 2 o T } and a model λ, compute P(o λ) P(o λ) = P(o q, λ)p(q λ) = all Q cool. so we re done? not quite. this will require O(N T ) calculations

22 dynamic programming to the rescue!?me T

23 dynamic programming to the rescue!

24 about that runtime N states to (N-1) other states T times can save a bit using the backward direction, too (see the paper); this will be referred to with a β and is analogous to α?me T

25 hidden markov models: the three problems

26 hidden markov models: problem 2 2 given a set of observations o = {o 1, o 2 o T } and a model λ, find most likely latent states (urns, say) q* = {q 1, q 2 q T } a unique solution need not exist! what are we even optimizing here? individual most likely states q i? joint probability q?

27 hidden markov models: problem 2 problem optimizing for the most likely state at any given time can lead to impossible sequences under λ so let s consider the joint instead

28 solving problem 2; the joint solution want to solve for q*: q* argmaxp(q,0 λ) q obviously enumerating all possible q sequences is infeasible dynamic programming to the rescue, again!

29 the viterbi algorithm define: (most likely sequence up to time t-1). the inductive step: that s it! we just need to keep track of the states we visit! i.e., from the best path so far, we find the most likely transition/emission probability

30 hidden markov models: the three problems

31 EM for λ forward to state i emit observation in state j backward to state j define from i to j (the probability of being in state i at time t and state j at time t+1)

32 EM for λ probability of being in state i is: (we have to transition somewhere)

33 EM for λ all of these es?mates can be computed from the observed data! (by coun?ng!)

34 EM for λ at a given time t we have λ = (A,B,π ) E-step calculate the likelihood of our observations o using the current estimates M-step re-estimate the parameters (left-hand sides of equations on previous slide) using current estimates

35 shortcomings of HMMs HMMs model the joint probability of the observations (x) and (y) - ie., it s a generative model but what we really care about is the conditional probability of a label sequence y given x enter conditional markov models

36 conditional models Y i 1 Y i Y i+1 X i 1 X i X i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 HMM MEMM CRF conditional models don t waste time modeling the observed data

37 MEMMs (2000, McCallum et al) MEMM: Maximum Entropy Markov Model output probability vector of transitioning from one state to all other states, given an input observation x - conditional in that we estimate p(state) at time t given an observation and the preceding state st-1 s t st-1 s t o t o t HMM MEMM

38 MEMMs v. HMM HMM prob. of emihng o and being in state s at?me t α t+1 (s) = s S α t (s )P (s s )P (o t+1 s) MEMM α t+1 (s) = s S α t (s )P s (s o t+1 )

39 CRFs (2001, Lafferty et al) CRF: Conditional Random Field single model for the joint probability of the sequence of labels given the observations mitigates the label bias problem in which states with low-entropy (almost certain) transition probability vectors effectively ignore the observation Y i 1 Y i Y i+1 Y i 1 Y i Y i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 X i 1 X i X i+1 X i 1 X i X i+1 HMM MEMM CRF

Hidden Markov Models

Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic