Course Overview Lecture 11 Dynamic ayesian Networks and Hidden Markov Models Decision Trees Marco Chiarandini Deptartment of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Introduction Artificial Intelligence Intelligent Agents Search Uninformed Search Heuristic Search Adversarial Search Minimax search Alpha-beta pruning Knowledge representation and Reasoning Propositional logic First order logic Inference Uncertain knowledge and Reasoning Probability and ayesian approach ayesian Networks Hidden Markov Chains Kalman Filters Decision Trees Maximum Likelihood EM Algorithm ayesian Networks Neural Networks Support vector machines 2 Performance of approximation algorithms Summary Absolute approximation: P(X e) ˆP(X e) ɛ Relative approximation: P(X e) ˆP(X e) P(X e) ɛ Relative = absolute since 0 P 1 (may be O(2 n )) Randomized algorithms may fail with probability at most δ Polytime approximation: poly(n, ɛ 1, log δ 1 ) Theorem (Dagum and Luby, 1993): both absolute and relative approximation for either deterministic or randomized algorithms are NP-hard for any ɛ, δ < 0.5 (Absolute approximation polytime with no evidence Chernoff bounds) Exact inference by variable elimination: polytime on polytrees, NP-hard on general graphs space = time, very sensitive to topology Approximate inference by Likelihood Weighting (LW), Markov Chain Monte Carlo Method (MCMC): PriorSampling and RejectionSampling unusable as evidence grow LW does poorly when there is lots of (late-in-the-order) evidence LW, MCMC generally insensitive to topology Convergence can be very slow with probabilities close to 1 or 0 Can handle arbitrary combinations of discrete and continuous variables 3 4
Outline Wumpus World 1,4 2,4 3,4 4,4 1. 1,3 2,3 3,3 4,3 2. 3. 4. 1,2 2,2 3,2 4,2 1,1 2,1 3,1 4,1 P ij = iff [i, j] contains a pit ij = iff [i, j] is breezy Include only 1,1, 1,2, 2,1 in the probability model 5 6 Specifying the probability model Observations and query The full joint distribution is P(P 1,1,..., P 4,4, 1,1, 1,2, 2,1 ) Apply product rule: P( 1,1, 1,2, 2,1 P 1,1,..., P 4,4 )P(P 1,1,..., P 4,4 ) (Do it this way to get P(Effect Cause).) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: P(P 1,1,..., P 4,4 ) = for n pits. 4,4 i,j = 1,1 P(P i,j ) = 0.2 n 0.8 16 n We know the following facts: b = b 1,1 b 1,2 b 2,1 known = p 1,1 p 1,2 p 2,1 Query is P(P 1,3 known, b) Define Unknown = P ij s other than P 1,3 and Known For inference by enumeration, we have P(P 1,3 known, b) = α P(P 1,3, unknown, known, b) unknown Grows exponentially with number of squares! 7 8
Using conditional independence Using conditional independence contd. asic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 OTHER QUERY 1,2 2,2 3,2 4,2 1,1 2,1 FRINGE 3,1 4,1 KNOWN Define Unknown = Fringe Other P(b P 1,3, Known, Unknown) = P(b P 1,3, Known, Fringe) Manipulate query into a form where we can use this! P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown = α X = α X fringe other fringe other X P(b known, P 1,3, fringe, other)p(p 1,3, known, fringe, other) X P(b known, P 1,3, fringe)p(p 1,3, known, fringe, other) = α fringep(b known, X P 1,3, fringe) X P(P 1,3, known, fringe, other) other = α X P(b known, P 1,3, fringe) X P(P 1,3 )P(known)P(fringe)P(other) fringe other = α P(known)P(P 1,3 ) X P(b known, P 1,3, fringe)p(fringe) X P(other) fringe other = α P(P 1,3 ) X P(b known, P 1,3, fringe)p(fringe) fringe 9 10 Using conditional independence contd. Outline 1,3 1,2 2,2 1,3 1,2 2,2 1,3 1,2 2,2 1,3 1,2 2,2 1,3 1,2 2,2 1. 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 1,1 2,1 3,1 1,1 2,1 3,1 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 2. P(P 1,3 known, b) = α 0.2(0.04 + 0.16 + 0.16), 0.8(0.04 + 0.16) 0.31, 0.69 P(P 2,2 known, b) 0.86, 0.14 3. 4. 11 12
Outline Time and uncertainty The world changes; we need to track and predict it Time and uncertainty Inference: filtering, prediction, smoothing Hidden Markov models Kalman filters (a brief mention) Dynamic ayesian networks (an even briefer mention) Diabetes management vs vehicle diagnosis asic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., loodsugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredloodSugar t, PulseRate t, FoodEaten t This assumes discrete time; step size depends on problem Notation: X a:b = X a, X a+1,..., X b 1, X b 13 14 Markov processes (Markov chains) Example Construct a ayes net from these variables: - unbounded number of conditional probability table - unbounded number of parents Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 0:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 0:t 1 ) = P(X t X t 2, X t 1 ) First order X t 2 X t 1 X t X t +1 X t +2 Rain t 1 Umbrella t 1 R t 1 t f P(R t ) 0.7 0.3 Rain t Rt t f Umbrella t P(U t ) 0.9 0.2 Rain t +1 Umbrella t +1 Second order X t 2 X t 1 X t X t +1 X t +2 Sensor Markov assumption: P(E t X 0:t, E 0:t 1 ) = P(E t X t ) Stationary process: transition model P(X t X t 1 ) and sensor model P(E t X t ) fixed for all t First-order Markov assumption not exactly in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state, e.g., add Temp t, Pressure t Example: robot motion. Augment position and velocity with attery t 15 16
Inference tasks Filtering Aim: devise a recursive state estimation algorithm: 1. Filtering: P(X t e 1:t ) belief state input to the decision process of a rational agent 2. Prediction: P(X t+k e 1:t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence 3. Smoothing: P(X k e 1:t ) for 0 k < t better estimate of past states, essential for learning 4. Most likely explanation: arg max x1:t P(x 1:t e 1:t ) speech recognition, decoding with a noisy channel P(X t+1 e 1:t+1 ) = f (e t+1, P(X t e 1:t )) P(X t+1 e 1:t+1 ) = P(X t+1 e 1:t, e t+1 ) = αp(e t+1 X t+1, e 1:t )P(X t+1 e 1:t ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) I.e., prediction + estimation. Prediction by summing out X t : P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 ) P(X t+1 x t, e 1:t )P(x t e 1:t ) x t = αp(e t+1 X t+1 ) P(X t+1 x t )P(x t e 1:t ) x t 17 f 1:t+1 = Forward(f 1:t, e t+1 ) where f 1:t = P(X t e 1:t ) Time and space constant (independent of t) by keeping track of f 18 Filtering example Smoothing True False Rain t 1 Umbrella t 1 Rain 0 R t 1 t f P(R t ) 0.7 0.3 Rain t Rt t f Umbrella t 0.818 0.182 Rain 1 P(U t ) 0.9 0.2 Rain t +1 Umbrella t +1 0.627 0.373 0.883 0.117 Rain 2 X 0 X 1 E1 Divide evidence e 1:t into e 1:k, e k+1:t : X k P(X k e 1:t ) = P(X k e 1:k, e k+1:t ) E k X t Et = αp(x k e 1:k )P(e k+1:t X k, e 1:k ) = αp(x k e 1:k )P(e k+1:t X k ) = αf 1:k b k+1:t ackward message computed by a backwards recursion: P(e k+1:t X k ) = X xk+1 P(e k+1:t X k, x k+1 )P(x k+1 X k ) Umbrella 1 Umbrella 2 = X xk+1 P(e k+1:t x k+1 )P(x k+1 X k ) 19 = X xk+1 P(e k+1 x k+1 )P(e k+2:t x k+1 )P(x k+1 X k ) 20
Smoothing example Most likely explanation True False 0.818 0.182 0.627 0.373 0.883 0.117 forward Most likely sequence sequence of most likely states (joint distr.)! Most likely path to each x t+1 = most likely path to some x t plus one more step 0.883 0.117 0.690 0.410 0.883 0.117 1.000 1.000 smoothed backward max P(x 1,..., x t, X t+1 e 1:t+1 ) x 1...x t ( ) = P(e t+1 X t+1 ) max P(X t+1 x t ) max P(x 1,..., x t 1, x t e 1:t ) x t x 1...x t 1 Rain 0 Rain 1 Rain 2 Identical to filtering, except f 1:t replaced by m 1:t = max P(x 1,..., x t 1, X t e 1:t ), x 1...x t 1 Umbrella 1 Umbrella 2 I.e., m 1:t (i) gives the probability of the most likely path to state i. Update has sum replaced by max, giving the Viterbi algorithm: If we want to smooth the whole sequence: Forward backward algorithm: cache forward messages along the way Time linear in t (polytree inference), space O(t f ) m 1:t+1 = P(e t+1 X t+1 ) max x t (P(X t+1 x t )m 1:t ) 21 22 Viterbi example Hidden Markov models state space paths most likely paths Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 false false false false false umbrella false.8182.5155.0361.0334.0210.1818.0491.1237.0173.0024 m 1:1 m 1:2 m 1:3 m 1:4 m 1:5 X t is a single, discrete variable (usually E t is too) Domain of X t is {1,..., S} ( ) 0.7 0.3 Transition matrix T ij = P(X t = j X t 1 = i), e.g., 0.3 0.7 Sensor matrix O t for each time ( step, diagonal ) elements P(e t X t = i) 0.9 0 e.g., with U 1 =, O 1 = 0 0.2 Forward and backward messages as column vectors: f 1:t+1 = αo t+1 T f 1:t b k+1:t = TO k+1 b k+2:t Forward-backward algorithm needs time O(S 2 t) and space O(St) 23 24
Kalman filters Updating Gaussian distributions Modelling systems described by a set of continuous variables, e.g., tracking a bird flying X t = X, Y, Z, Ẋ, Ẏ, Ż. Airplanes, robots, ecosystems, economies, chemical plants, planets,... Xt Xt+1 Xt Xt+1 Zt Zt+1 Prediction step: if P(X t e 1:t ) is Gaussian, then prediction P(X t+1 e 1:t ) = P(X t+1 x t )P(x t e 1:t ) dx t x t is Gaussian. If P(X t+1 e 1:t ) is Gaussian, then the updated distribution P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) is Gaussian Hence P(X t e 1:t ) is multivariate Gaussian N(µ t, Σ t ) for all t General (nonlinear, non-gaussian) process: description of posterior grows unboundedly as t Gaussian prior, linear Gaussian transition model and sensor model 25 26 2-D tracking example: filtering 2-D tracking example: smoothing 12 2D filtering 12 2D smoothing 11 observed filtered 11 observed smoothed 10 10 Y 9 Y 9 8 8 7 7 6 8 10 12 14 16 18 20 22 24 26 X 6 8 10 12 14 16 18 20 22 24 26 X 27 28
Where it breaks Dynamic ayesian networks Cannot be applied if the transition model is nonlinear Extended Kalman Filter models transition as locally linear around x t = µ t Fails if systems is locally unsmooth X t, E t contain arbitrarily many variables in a replicated ayes net Meter 1 P(R ) 0 0.7 R 0 t f P(R 1) 0.7 0.3 Rain 0 Rain 1 attery 0 attery 1 R 1 t f P(U 1) 0.9 0.2 X 0 X 1 Umbrella 1 txx 0 X 1 Z 1 29 30 DNs vs. HMMs DNs vs Kalman filters Every HMM is a single-variable DN; every discrete DN is an HMM X t X t+1 Y t Yt+1 Every Kalman filter model is a DN, but few DNs are KFs; real world requires non-gaussian posteriors Z t Z t+1 Sparse dependencies exponentially fewer parameters; e.g., 20 state variables, three parents each DN has 20 2 3 = 160 parameters, HMM has 2 20 2 20 10 12 31 32
Summary Outline Temporal models use state and sensor variables replicated over time Markov assumptions and stationarity assumption, so we need transition model P(X t X t 1 ) sensor model P(E t X t ) Tasks are filtering, prediction, smoothing, most likely sequence; all done recursively with constant cost per time step Hidden Markov models have a single discrete state variable; used for speech recognition Kalman filters allow n state variables, linear Gaussian, O(n 3 ) update 1. 2. 3. 4. Dynamic ayes nets subsume HMMs, Kalman filters; exact update intractable 33 34 Outline Speech as probabilistic inference Speech signals are noisy, variable, ambiguous Speech as probabilistic inference Speech sounds Word pronunciation Word sequences What is the most likely word sequence, given the speech signal? I.e., choose Words to maximize P(Words signal) Use ayes rule: P(Words signal) = αp(signal Words)P(Words) I.e., decomposes into acoustic model + language model Words are the hidden state sequence, signal is the observation sequence 35 36
Phones Word pronunciation models All human speech is composed from 40-50 phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal acoustic model = pronunciation model + phone model ARPAbet designed for American English [iy] beat [b] bet [p] pet [ih] bit [ch] Chet [r] rat [ey] bet [d] debt [s] set [ao] bought [hh] hat [th] thick [ow] boat [hv] high [dh] that [er] ert [l] let [w] wet [ix] roses [ng] sing [en] button.... E.g., ceiling is [s iy l ih ng] / [s iy l ix ng] / [s iy l en].. Each word is described as a distribution over phone sequences Distribution represented as an HMM transition model [t] 0.2 0.8 [ow] [ah] 1.0 1.0 [m] 0.5 0.5 [ey] [aa] 1.0 1.0 [t] 1.0 [ow] P([towmeytow] tomato ) = P([towmaatow] tomato ) = 0.1 P([tahmeytow] tomato ) = P([tahmaatow] tomato ) = 0.4 Structure is created manually, transition probabilities learned from data 37 41 Isolated words Continuous speech Phone models + word models fix likelihood P(e 1:t word) for isolated word P(word e 1:t ) = αp(e 1:t word)p(word) Prior probability P(word) obtained simply by counting word frequencies P(e 1:t word) can be computed recursively: define l 1:t = P(X t, e 1:t ) Not just a sequence of isolated-word recognition problems! Adjacent words highly correlated Sequence of most likely words most likely sequence of words Segmentation: there are few gaps in speech Cross-word coarticulation e.g., next thing Continuous speech systems manage 60 80% accuracy on a good day and use the recursive update l 1:t+1 = Forward(l 1:t, e t+1 ) and then P(e 1:t word) = x t l 1:t (x t ) Isolated-word dictation systems with training reach 95 99% accuracy 42 43
Language model Combined HMM Prior probability of a word sequence is given by chain rule: P(w 1 w n ) = igram model: n P(w i w 1 w i 1 ) i=1 P(w i w 1 w i 1 ) P(w i w i 1 ) Train by counting all word pairs in a large text corpus More sophisticated models (trigrams, grammars, etc.) help a little bit States of the combined language+word+phone model are labelled by the word we re in + the phone in that word + the phone state in that phone Viterbi algorithm finds the most likely phone state sequence Does segmentation by considering all possible word sequences and boundaries Doesn t always give the most likely word sequence because each word sequence is the sum over many state sequences Jelinek invented A in 1969 a way to find most likely word sequence where step cost is log P(w i w i 1 ) 44 45 Outline Outline 1. 2. 3. agents Inductive learning Decision tree learning Measuring learning performance 4. 46 47
agents Performance standard ack to Turing s article: - child mind program - education Critic Sensors Reward & Punishment is essential for unknown environments, i.e., when designer lacks omniscience is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down feedback learning goals element changes knowledge Performance element Environment modifies the agent s decision mechanisms to improve performance Problem generator experiments Agent Effectors 48 49 element Design of learning element is dictated by what type of performance element is used which functional component is to be learned how that functional compoent is represented what kind of feedback is available Example scenarios: Performance element Component Representation Feedback Alpha beta search Eval. fn. Weighted linear function Win/loss Logical agent Transition model Successor state axioms Outcome Utility based agent Transition model Dynamic ayes net Outcome Simple reflex agent Percept action fn Neural net Correct action Supervised learning: correct answers for each instance Reinforcement learning: occasional rewards 50