Artificial Intelligence Markov Chains

Size: px

Start display at page:

Download "Artificial Intelligence Markov Chains"

Claude Reed
6 years ago
Views:

1 Artificial Intelligence Markov Chains Stephan Dreiseitl FH Hagenberg Software Engineering & Interactive Media Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Overview Uncertain reasoning in time Using Markov chains for simulations Hidden Markov models State estimation Most probable path estimation Application: speech recognition Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

2 Uncertain reasoning in time Want to model systems that change through time, in some non-deterministic manner Use stochastic processes: Collections of random variables X 0, X 1,... that take on values in some state space S One random variable collection for each quantity of interest Current random variable value (state) depends on previous states Easier to model if states are observable, otherwise use hidden Markov models Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Example: Wumpus world Wumpus world is static, only agent moves Assume that agent does not reason logically, but moves randomly Current agent position X t depends on previous positions X t 1,..., X 1 I.e., current position is given by P(X t X t 1,..., X 1 ) How to calculate this? Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

3 Simplifying assumptions Markov assumption: Current state depends on finite number of previous states First-order Markov process (chain): P(X t X t 1,..., X 0 ) = P(X t X t 1 ) Second-order Markov process (chain): P(X t X t 1,..., X 0 ) = P(X t X t 1, X t 2 ) Stationarity assumption: Transition probabilities P(X t parents(x t )) do not depend on t Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Markov chains as Bayesian networks First-order Markov chain:... = P( X t 1 X t 2 ) = P( X t X t 1 ) = X t+1 X t P( ) =... X t 2 X t 1 X t X t+1 Second-order Markov chain: X t 2 X t 1 X t X t+1... = P( X t+1 X t, X t 1 ) =... Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

4 Using Markov chains for simulations Express transition probabilities P(X t = s i X t 1 = s j ) in matrix A ij (columns always sum to 1) Equilibrium distribution of Markov chain: Distribution the chain converges to, i.e. lim t P(X t ) Problem: Sometimes hard to generate random values for complicated distributions (e.g., Bayesian networks) Solution: Construct Markov chain with desired distribution as equilibrium distribution, random values are samples of Markov chain Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Simple simulation example States S = {s 1, s 2 }, state transition matrix A = ( ) s s 1 2 1/2 1/4 1/2 3/4 s s 1 2 s s 1 2 s s 1 2 X t 2 X t 1 X t X t+1 With matrix algebra and conditional probabilities, can show that P(X t+1 ) = A t P(X 1 ) and lim t A t = ( 1 3 Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / )

5 Simple simulation example (cont.) Obtain equilibrium distribution P(X t ) = ( 1 3, 2 3) for arbitrary initial distribution P(X 1 ) Verify numerically: relative frequencies of states s 1 and s 2 with start state s 1 (left) and s 2 (right) Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Calculating state probabilities Consider stochastic wumpus world example: States are positions on 4 4 board Distinguish between r.v. X t, state constants s j and variables S t denoting state in time t For brevity, also write S t to denote X t = S t Marginalize to obtain probability P(X t = g) of reaching gold at time step t as P(X t = g) = P(X t = g (S 1,..., S t 1 )) all state sequences (S 1,...,S t 1 ) that lead to g Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

6 Calculating state probabilities (cont.) State transitions of agent are first-order Markov process: With known first state (i.e., P(X 1 = s i ) = 1) obtain P(X t = g (S 1,..., S t 1 )) = P(X t = g S t 1 )P(S t 1 S t 2 ) P(S 2 S 1 ) Therefore, calculation P(X t = g) = all state sequences (S 1,...,S t 1 ) that lead to g grows exponentially with t (bad) P(X t = g S t 1 ) P(S 2 S 1 ) Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Improving calculations To simplify notation, write p t (i) = P(X t = s i ) Idea to reduce complexity to polynomial (good) { 1 if s i is start state p 1 (i) = 0 otherwise n p t+1 (i) = P(X t+1 = s i ) = P(X t+1 = s i X t = s j ) j=1 n n = P(X t+1 = s i X t = s j )P(X t = s j ) = A ij p t (j) j=1 This trick (dynamic programming) used often with hidden Markov models Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 j=1

7 Hidden Markov models (HMMs) Often, states of world are not observable (hidden). However, some evidence E t that depends on state (stochastically) is available, i.e., know P(E t X t ) X t 2 X t 1 X t X t+1 E t 2 E t 1 E t E t+1 Assume P(X t X t 1 ) and P(E t X t ) do not depend on t: t t P(X 1,..., X t, E 1,..., E t ) = P(X 1 ) P(X k X k 1 ) P(E k X k ) k=2 k=1 Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 HMM formal specification To completely specify an HMM, we need number N of possible hidden states for each X t number M of possible observations for each E t initial state probabilites π 1,..., π N : π i = P(X 1 = s i ) state transition prob. A ij = P(X t = s i X t 1 = s j ) observation prob. B j (o i ) = P(E t = o i X t = s j ) Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

8 HMM notational conventions Distinguish between possible states s 1,..., s N for each r.v. X t and concrete state S t that system is in at time t (one of the s 1,..., s N ) For brevity, write S t instead of X t = S t Same for evidence E t : At time t, one of M possible outputs o 1,..., o M can be observed. Use O t to denote concrete observation at time t (one of o 1,..., o M ) For brevity, write O t instead of E t = O t Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 HMM interesting problems State estimation: What is probability of state s i, given list of observations? I.e., what is P(X t = s i (O 1,..., O t ))? Most probable path: Given observations O 1,..., O t, what is the most probable sequence of states S 1,..., S t? Learning HMMs: Given observations O 1,..., O t, what is most likely HMM to produce these observations? HMM applications: Speech recognition, bioinformatics, consumer decision modelling, economics and finance Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

9 Simple HMM example Employee wants to infer food quality in cafeteria from co-worker s expression after lunch Three food qualities (hidden states): good (g), mediocre (m), bad (b) Three co-worker s expressions (observations): happy (h), indifferent (i), angry (a) One day s food quality influences next (leftovers) X t 2 X t 1 X t X t+1 E t 2 E t 1 E t E t+1 Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Simple HMM example (cont.) Start: P(X 1 =g)= 0.3,P(X 1 =m)= 0.5,P(X 1 =b)= 0.2 State transitions: P(g g) = 0.1 P(g m) = 0.3 P(g b) = 0 P(m g) = 0.7 P(m m) = 0.6 P(m b) = 0.8 P(b g) = 0.2 P(b m) = 0.1 P(b b) = 0.2 Observation probabilities: P(h g) = 0.8 P(h m) = 0.3 P(h b) = 0.1 P(i g) = 0.2 P(i m) = 0.5 P(i b) = 0.2 P(a g) = 0 P(a m) = 0.2 P(a b) = 0.7 Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

10 Simple HMM example (cont.) Assume first three days are like this: m g m i h h Employee sees only co-worker s expression sequence (i,h,h) What can be inferred about the food quality? Tackle some easier questions first Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Probability of observation sequence What is probability of the observation sequence (i,h,h)? Not very good way (but easier to see): P(i,h,h) = = 3-element state sequence (S 1,S 2,S 3 ) 3-element state sequence (S 1,S 2,S 3 ) How to compute P(S 1, S 2, S 3 )? P((i,h,h) (S 1, S 2, S 3 )) P((i,h,h) (S 1, S 2, S 3 )) P(S 1, S 2, S 3 ) How to compute P((i,h,h) (S 1, S 2, S 3 ))? Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

11 Probability of observation sequence (cont.) How to compute P(S 1, S 2, S 3 )? With cond. independence P(S 1, S 2, S 3 ) = P(S 1 )P(S 2 S 1 )P(S 3 S 2 ) E.g., with (S 1, S 2, S 3 ) = (m,b,b), we get P = = 0.01 How to compute P((i,h,h) (S 1, S 2, S 3 ))? P((i,h,h) (S 1, S 2, S 3 )) = P(i S 1 )P(h S 2 )P(h S 3 ) E.g., with (S 1, S 2, S 3 ) = (m,b,b), we get P = = Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Probability of observation sequence (cont.) Problem: 27 possibilities for (S 1, S 2, S 3 ), so calculating P(i,h,h) = P((i,h,h) (S 1, S 2, S 3 ))P(S 1, S 2, S 3 ) 3-element state sequence (S 1,S 2,S 3 ) requires = 54 calculations (exponential growth in length of sequence) Better: Use same trick as before (dynamic programming) Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

12 Dynamic programming for state estimation For observation sequence (O 1,..., O n ) and t n, define α t (i) = P(X t = s i (O 1,..., O t )) as probability of seeing (O 1,..., O t ) and ending in state s i Recursive definition gives polynomial time calculation: B i (O 1 ) π i if t = 1 α t (i) = N B i (O t ) A ik α t 1 (k) if t > 1 k=1 Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Dynamic programming for state estimation Now easy to calculate probabilities of interest: Because of marginalizing yields α t (i) = P(X t = s i (O 1,..., O t )) P(O 1,..., O t ) = N α t (i) i=1 and definition of conditional probability gives P(X t = s i (O 1,..., O t )) = α t (i) N i=1 α t(i) Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

13 State estimation in cafeteria example Calculate state probabilities for observations (i,h,h): α 1 (g) = = 0.06 α 1 (m) = = 0.25 α 1 (b) = = 0.04 α 2 (g) = = α 2 (m) = = α 2 (b) = = α 3 (g) = = α 3 (m) = = α 3 (b) = = Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 State estimation in cafeteria example Most likely state at time 1 is m with 0.25 P(X 1 = m i) = = Most likely state at time 2 is m with P(X 2 = m (i,h)) = = Most likely state at time 3 is m with P(X 3 = m (i,h,h)) = = Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

14 Inferring most probable path For given observation sequence (O 1,..., O t ), find state sequence (S 1,..., S t ) with P((S 1,..., S t ) (O 1,..., O t )) max Call this best sequence (S 1,..., S t ). Slow idea to calculate: P((S 1,..., S t ) (O 1,..., O t )) = P((O 1,..., O t ) (S 1,..., S t ))P(S 1,..., S t ) P(O 1,..., O t ) (S 1,..., S t ) is not sequence of most likely states! Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Dynamic programming for most prob. path Use dynamic programming: For each state s i and time t, calculate most probable path that ends in s i at t: mpp t (i) Can do this recursively (as before) Key insight: mpp t (i) can be calculated from all mpp t 1 (j) that are one state shorter transition probabilities P(X t = s i X t 1 = s j ) probability B i (O t ) of observing O t in state s i Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

15 Viterbi algorithm Fleshing out these ideas is Viterbi algorithm δ t (i) = max P((S 1,..., S t 1 ) X t = s i (O 1,..., O t )) S 1,...,S t 1 mpp t (i) is the path that achieves probability δ t (i) Recursive formula: { B i (O 1 ) π i if t = 1 δ t (i) = max j {B i (O t ) A ij δ t 1 (j)} if t > 1 Then, (S 1,..., S t ) is mpp t (i) with final state S t = s i s.t. s i = max i δ t (i) Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Viterbi algorithm on cafeteria example Calculate most probable path for observations (i,h,h): δ 1 (g) = = 0.06 δ 1 (m) = = 0.25 δ 1 (b) = = 0.04 δ 2 (g) = max{ , , } = 0.06 δ 2 (m) = max{ , , } = δ 2 (b) = max{ , , } = δ 3 (g) = max{ , , } = δ 3 (m) = max{ , , } = δ 3 (b) = max{ , , } = Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

16 Viterbi algorithm on cafeteria example Highest δ 3 value is δ 3 (m) = , so S 3 = m Work backwards t = 3 t = 2: S3 transition g m, so S2 = g One more step t = 2 t = 1: S2 transition m g, so S1 = m Most probable path is therefore (m,g,m) achieved by a achieved by a Not the same as sequence of most probable states (m,m,m)! Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Sample application: speech recognition Have signals, want to find words that generate signals, i.e. maximize P(words signals) With Bayes rules, get P(words signals) = α P(signals words) }{{} P(words) }{{} acoustic model language model Acoustic model comprises pronounciation model and phone model A phone is an atomic speech sound, a phoneme is a set of phones that is indistinguishable in a language Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

17 Phone models Sound is discretized, split into frames (typically 30ms long) and represented by features Analog acoustic signal: Sampled, quantized digital signal: Frames with features: Phone model is P(feature phone) Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Pronounciation models Each word is represented as a distribution over phone sequences, implemented as a transition model [t] 0.2 [ow] 1.0 [m] 0.5 [ey] 1.0 [t] 1.0 [ow] 0.8 [ah] [aa] 1.0 P([towmeytow] tomato ) = P([towmaatow] tomato ) = 0.1 P([tahwmeytow] tomato ) = P([tahmaatow] tomato ) = 0.4 Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

18 Language models Prior probability P(w 1,..., w n ) of word sequences modeled with Markov assumption (bigram model) P(w 1,..., w n ) = P(w 1 ) n P(w i w i 1 ) i=2 Obtain conditional probabilities by analyzing large texts Can be improved by model of language grammar Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36 Summary Temporal/sequential reasoning achieved by random processes Markov chains: current state depends only on previous state Markov chains widely used in simulations Hidden Markov models when states are not observable State estimation and most probable path by dynamic programming only linear time/space complexity Sample HMM application: speech recognition Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 12: Markov Chains Artificial Intelligence SS / 36

Artificial Intelligence Bayesian Networks

Artificial Intelligence Bayesian Networks Stephan Dreiseitl FH Hagenberg Software Engineering & Interactive Media Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence