Dynamic Bayesian Networks and Hidden Markov Models Decision Trees

Size: px

Start display at page:

Download "Dynamic Bayesian Networks and Hidden Markov Models Decision Trees"

Kathleen Griffin
5 years ago
Views:

1 Lecture 11 Dynamic Bayesian Networks and Hidden Markov Models Decision Trees Marco Chiarandini Deptartment of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig

2 Course Overview Introduction Artificial Intelligence Intelligent Agents Search Uninformed Search Heuristic Search Adversarial Search Minimax search Alpha-beta pruning Knowledge representation and Reasoning Propositional logic First order logic Inference Uncertain knowledge and Reasoning Probability and Bayesian approach Bayesian Networks Hidden Markov Chains Kalman Filters Decision Trees Maximum Likelihood EM Algorithm Bayesian Networks Neural Networks Support vector machines 2

3 Performance of approximation algorithms Absolute approximation: P(X e) ˆP(X e) ɛ 3

4 Performance of approximation algorithms Absolute approximation: P(X e) ˆP(X e) ɛ Relative approximation: P(X e) ˆP(X e) P(X e) ɛ 3

5 Performance of approximation algorithms Absolute approximation: P(X e) ˆP(X e) ɛ Relative approximation: P(X e) ˆP(X e) P(X e) ɛ Relative = absolute since 0 P 1 (may be O(2 n )) 3

6 Performance of approximation algorithms Absolute approximation: P(X e) ˆP(X e) ɛ Relative approximation: P(X e) ˆP(X e) P(X e) ɛ Relative = absolute since 0 P 1 (may be O(2 n )) Randomized algorithms may fail with probability at most δ 3

7 Performance of approximation algorithms Absolute approximation: P(X e) ˆP(X e) ɛ Relative approximation: P(X e) ˆP(X e) P(X e) ɛ Relative = absolute since 0 P 1 (may be O(2 n )) Randomized algorithms may fail with probability at most δ Polytime approximation: poly(n, ɛ 1, log δ 1 ) 3

8 Performance of approximation algorithms Absolute approximation: P(X e) ˆP(X e) ɛ Relative approximation: P(X e) ˆP(X e) P(X e) ɛ Relative = absolute since 0 P 1 (may be O(2 n )) Randomized algorithms may fail with probability at most δ Polytime approximation: poly(n, ɛ 1, log δ 1 ) Theorem (Dagum and Luby, 1993): both absolute and relative approximation for either deterministic or randomized algorithms are NP-hard for any ɛ, δ < 0.5 (Absolute approximation polytime with no evidence Chernoff bounds) 3

9 Summary Exact inference by variable elimination: polytime on polytrees, NP-hard on general graphs space = time, very sensitive to topology Approximate inference by Likelihood Weighting (LW), Markov Chain Monte Carlo Method (MCMC): PriorSampling and RejectionSampling unusable as evidence grow LW does poorly when there is lots of (late-in-the-order) evidence LW, MCMC generally insensitive to topology Convergence can be very slow with probabilities close to 1 or 0 Can handle arbitrary combinations of discrete and continuous variables 4

10 Outline

11 Wumpus World 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 1,2 2,2 3,2 4,2 B OK 1,1 2,1 3,1 4,1 OK B OK P ij = true iff [i, j] contains a pit B ij = true iff [i, j] is breezy Include only B 1,1, B 1,2, B 2,1 in the probability model 6

12 Specifying the probability model The full joint distribution is P(P 1,1,..., P 4,4, B 1,1, B 1,2, B 2,1 ) Apply product rule: P(B 1,1, B 1,2, B 2,1 P 1,1,..., P 4,4 )P(P 1,1,..., P 4,4 ) (Do it this way to get P(Effect Cause).) 7

13 Specifying the probability model The full joint distribution is P(P 1,1,..., P 4,4, B 1,1, B 1,2, B 2,1 ) Apply product rule: P(B 1,1, B 1,2, B 2,1 P 1,1,..., P 4,4 )P(P 1,1,..., P 4,4 ) (Do it this way to get P(Effect Cause).) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: P(P 1,1,..., P 4,4 ) = for n pits. 4,4 i,j = 1,1 P(P i,j ) = 0.2 n n 7

14 Observations and query We know the following facts: b = b 1,1 b 1,2 b 2,1 known = p 1,1 p 1,2 p 2,1 Query is P(P 1,3 known, b) Define Unknown = P ij s other than P 1,3 and Known 8

15 Observations and query We know the following facts: b = b 1,1 b 1,2 b 2,1 known = p 1,1 p 1,2 p 2,1 Query is P(P 1,3 known, b) Define Unknown = P ij s other than P 1,3 and Known For inference by enumeration, we have P(P 1,3 known, b) = α P(P 1,3, unknown, known, b) unknown Grows exponentially with number of squares! 8

16 Using conditional independence Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 OTHER QUERY 1,2 2,2 3,2 4,2 1,1 2,1 FRINGE 3,1 4,1 KNOWN 9

17 Using conditional independence Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 OTHER QUERY 1,2 2,2 3,2 4,2 1,1 2,1 FRINGE 3,1 4,1 KNOWN Define Unknown = Fringe Other P(b P 1,3, Known, Unknown) = P(b P 1,3, Known, Fringe) Manipulate query into a form where we can use this! 9

18 Using conditional independence contd. P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown 10

19 Using conditional independence contd. P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown 10

20 Using conditional independence contd. P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown = α X fringe other X P(b known, P 1,3, fringe, other)p(p 1,3, known, fringe, other) 10

21 Using conditional independence contd. P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown = α X = α X fringe other fringe other X P(b known, P 1,3, fringe, other)p(p 1,3, known, fringe, other) X P(b known, P 1,3, fringe)p(p 1,3, known, fringe, other) 10

22 Using conditional independence contd. P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown = α X = α X = α X fringe other fringe other X P(b known, P 1,3, fringe, other)p(p 1,3, known, fringe, other) X P(b known, P 1,3, fringe)p(p 1,3, known, fringe, other) fringep(b known, P 1,3, fringe) X other P(P 1,3, known, fringe, other) 10

23 Using conditional independence contd. P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown = α X = α X fringe other fringe other X P(b known, P 1,3, fringe, other)p(p 1,3, known, fringe, other) X P(b known, P 1,3, fringe)p(p 1,3, known, fringe, other) = α fringep(b known, X P 1,3, fringe) X P(P 1,3, known, fringe, other) other = α X P(b known, P 1,3, fringe) X P(P 1,3)P(known)P(fringe)P(other) fringe other 10

24 Using conditional independence contd. P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown = α X = α X fringe other fringe other X P(b known, P 1,3, fringe, other)p(p 1,3, known, fringe, other) X P(b known, P 1,3, fringe)p(p 1,3, known, fringe, other) = α fringep(b known, X P 1,3, fringe) X P(P 1,3, known, fringe, other) other = α X P(b known, P 1,3, fringe) X P(P 1,3)P(known)P(fringe)P(other) fringe other = α P(known)P(P X 1,3) P(b known, P 1,3, fringe)p(fringe) X P(other) fringe other 10

25 Using conditional independence contd. P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown = α X = α X fringe other fringe other X P(b known, P 1,3, fringe, other)p(p 1,3, known, fringe, other) X P(b known, P 1,3, fringe)p(p 1,3, known, fringe, other) = α fringep(b known, X P 1,3, fringe) X P(P 1,3, known, fringe, other) other = α X P(b known, P 1,3, fringe) X P(P 1,3)P(known)P(fringe)P(other) fringe other = α P(known)P(P X 1,3) P(b known, P 1,3, fringe)p(fringe) X P(other) fringe other = α P(P X 1,3) P(b known, P 1,3, fringe)p(fringe) fringe 10

26 Using conditional independence contd. 1,3 1,3 1,3 1,3 1,3 1,2 2,2 B 1,2 2,2 B 1,2 2,2 B 1,2 2,2 B 1,2 2,2 B OK OK OK OK OK 1,1 2,1 3,1 B 1,1 2,1 3,1 B 1,1 2,1 3,1 B 1,1 2,1 3,1 B 1,1 2,1 3,1 B OK OK OK OK OK OK OK OK OK OK 0.2 x 0.2 = x 0.8 = x 0.2 = x 0.2 = x 0.8 = 0.16 P(P 1,3 known, b) = α 0.2( ), 0.8( ) 0.31, 0.69 P(P 2,2 known, b) 0.86,

27 Outline

28 Outline Time and uncertainty Inference: filtering, prediction, smoothing Hidden Markov models Kalman filters (a brief mention) Dynamic Bayesian networks (an even briefer mention) 13

29 Time and uncertainty The world changes; we need to track and predict it 14

30 Time and uncertainty The world changes; we need to track and predict it Diabetes management vs vehicle diagnosis 14

31 Time and uncertainty The world changes; we need to track and predict it Diabetes management vs vehicle diagnosis Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t, PulseRate t, FoodEaten t 14

32 Time and uncertainty The world changes; we need to track and predict it Diabetes management vs vehicle diagnosis Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t, PulseRate t, FoodEaten t This assumes discrete time; step size depends on problem 14

33 Time and uncertainty The world changes; we need to track and predict it Diabetes management vs vehicle diagnosis Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t, PulseRate t, FoodEaten t This assumes discrete time; step size depends on problem Notation: X a:b = X a, X a+1,..., X b 1, X b 14

34 Markov processes (Markov chains) Construct a Bayes net from these variables: 15

35 Markov processes (Markov chains) Construct a Bayes net from these variables: - unbounded number of conditional probability table - unbounded number of parents 15

36 Markov processes (Markov chains) Construct a Bayes net from these variables: - unbounded number of conditional probability table - unbounded number of parents Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 0:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 0:t 1 ) = P(X t X t 2, X t 1 ) First order X t 2 X t 1 X t X t +1 X t +2 Second order X t 2 X t 1 X t X t +1 X t +2 15

37 Markov processes (Markov chains) Construct a Bayes net from these variables: - unbounded number of conditional probability table - unbounded number of parents Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 0:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 0:t 1 ) = P(X t X t 2, X t 1 ) First order X t 2 X t 1 X t X t +1 X t +2 Second order X t 2 X t 1 X t X t +1 X t +2 Sensor Markov assumption: P(E t X 0:t, E 0:t 1 ) = P(E t X t ) 15

38 Markov processes (Markov chains) Construct a Bayes net from these variables: - unbounded number of conditional probability table - unbounded number of parents Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 0:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 0:t 1 ) = P(X t X t 2, X t 1 ) First order X t 2 X t 1 X t X t +1 X t +2 Second order X t 2 X t 1 X t X t +1 X t +2 Sensor Markov assumption: P(E t X 0:t, E 0:t 1 ) = P(E t X t ) Stationary process: transition model P(X t X t 1 ) and sensor model P(E t X t ) fixed for all t 15

39 Example Rain t 1 R t 1 t f P(R t ) Rain t Rain t +1 Rt t f P(U t ) Umbrella t 1 Umbrella t Umbrella t +1 16

40 Example Rain t 1 R t 1 t f P(R t ) Rain t Rain t +1 Rt t f P(U t ) Umbrella t 1 Umbrella t Umbrella t +1 First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state, e.g., add Temp t, Pressure t Example: robot motion. Augment position and velocity with Battery t 16

41 Inference tasks 1. Filtering: P(X t e 1:t ) belief state input to the decision process of a rational agent 2. Prediction: P(X t+k e 1:t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence 3. Smoothing: P(X k e 1:t ) for 0 k < t better estimate of past states, essential for learning 4. Most likely explanation: arg max x1:t P(x 1:t e 1:t ) speech recognition, decoding with a noisy channel 17

42 Filtering Aim: devise a recursive state estimation algorithm: P(X t+1 e 1:t+1 ) = f (e t+1, P(X t e 1:t )) 18

43 Filtering Aim: devise a recursive state estimation algorithm: P(X t+1 e 1:t+1 ) = f (e t+1, P(X t e 1:t )) P(X t+1 e 1:t+1 ) = P(X t+1 e 1:t, e t+1 ) = αp(e t+1 X t+1, e 1:t )P(X t+1 e 1:t ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) I.e., prediction + estimation. 18

44 Filtering Aim: devise a recursive state estimation algorithm: P(X t+1 e 1:t+1 ) = f (e t+1, P(X t e 1:t )) P(X t+1 e 1:t+1 ) = P(X t+1 e 1:t, e t+1 ) = αp(e t+1 X t+1, e 1:t )P(X t+1 e 1:t ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) I.e., prediction + estimation. Prediction by summing out X t : P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 ) P(X t+1 x t, e 1:t )P(x t e 1:t ) x t = αp(e t+1 X t+1 ) x t P(X t+1 x t )P(x t e 1:t ) f 1:t+1 = Forward(f 1:t, e t+1 ) where f 1:t = P(X t e 1:t ) Time and space constant (independent of t) by keeping track of f 18

45 Filtering example Rain t 1 R t 1 t f P(R t ) Rain t Rain t +1 Rt t f P(U t ) Umbrella t 1 Umbrella t Umbrella t +1 19

46 Filtering example Rain t 1 R t 1 t f P(R t ) Rain t Rain t +1 Rt t f P(U t ) Umbrella t 1 Umbrella t Umbrella t True False Rain 0 Rain 1 Rain 2 Umbrella 1 Umbrella 2 19

47 Smoothing X 0 X 1 X k X t E1 E k Et 20

48 Smoothing X 0 X 1 X k X t E1 Divide evidence e 1:t into e 1:k, e k+1:t : P(X k e 1:t ) = P(X k e 1:k, e k+1:t ) E k Et = αp(x k e 1:k )P(e k+1:t X k, e 1:k ) = αp(x k e 1:k )P(e k+1:t X k ) = αf 1:k b k+1:t 20

49 Smoothing X 0 X 1 X k X t E1 Divide evidence e 1:t into e 1:k, e k+1:t : P(X k e 1:t ) = P(X k e 1:k, e k+1:t ) E k Et = αp(x k e 1:k )P(e k+1:t X k, e 1:k ) = αp(x k e 1:k )P(e k+1:t X k ) = αf 1:k b k+1:t Backward message computed by a backwards recursion: P(e k+1:t X k ) = X xk+1 P(e k+1:t X k, x k+1 )P(x k+1 X k ) = X xk+1 P(e k+1:t x k+1 )P(x k+1 X k ) = X P(e k+1 x k+1 )P(e k+2:t x k+1 )P(x k+1 X k ) 20

50 Smoothing example True False forward smoothed backward Rain 0 Rain 1 Rain 2 Umbrella 1 Umbrella 2 If we want to smooth the whole sequence: Forward backward algorithm: cache forward messages along the way Time linear in t (polytree inference), space O(t f ) 21

51 Most likely explanation Most likely sequence sequence of most likely states (joint distr.)! 22

52 Most likely explanation Most likely sequence sequence of most likely states (joint distr.)! Most likely path to each x t+1 = most likely path to some x t plus one more step max P(x 1,..., x t, X t+1 e 1:t+1 ) x 1...x t ( ) = P(e t+1 X t+1 ) max P(X t+1 x t ) max P(x 1,..., x t 1, x t e 1:t ) x t x 1...x t 1 22

53 Most likely explanation Most likely sequence sequence of most likely states (joint distr.)! Most likely path to each x t+1 = most likely path to some x t plus one more step max P(x 1,..., x t, X t+1 e 1:t+1 ) x 1...x t ( ) = P(e t+1 X t+1 ) max P(X t+1 x t ) max P(x 1,..., x t 1, x t e 1:t ) x t x 1...x t 1 Identical to filtering, except f 1:t replaced by m 1:t = max P(x 1,..., x t 1, X t e 1:t ), x 1...x t 1 I.e., m 1:t (i) gives the probability of the most likely path to state i. 22

54 Most likely explanation Most likely sequence sequence of most likely states (joint distr.)! Most likely path to each x t+1 = most likely path to some x t plus one more step max P(x 1,..., x t, X t+1 e 1:t+1 ) x 1...x t ( ) = P(e t+1 X t+1 ) max P(X t+1 x t ) max P(x 1,..., x t 1, x t e 1:t ) x t x 1...x t 1 Identical to filtering, except f 1:t replaced by m 1:t = max P(x 1,..., x t 1, X t e 1:t ), x 1...x t 1 I.e., m 1:t (i) gives the probability of the most likely path to state i. Update has sum replaced by max, giving the Viterbi algorithm: m 1:t+1 = P(e t+1 X t+1 ) max x t (P(X t+1 x t )m 1:t ) 22

55 Viterbi example Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 state space paths true false true false true false true false true false umbrella true true false true true most likely paths m 1:1 m 1:2 m 1:3 m 1:4 m 1:5 23

56 Hidden Markov models X t is a single, discrete variable (usually E t is too) Domain of X t is {1,..., S} ( ) Transition matrix T ij = P(X t = j X t 1 = i), e.g., Sensor matrix O t for each time ( step, diagonal ) elements P(e t X t = i) e.g., with U 1 = true, O 1 = Forward and backward messages as column vectors: f 1:t+1 = αo t+1 T f 1:t b k+1:t = TO k+1 b k+2:t Forward-backward algorithm needs time O(S 2 t) and space O(St) 24

57 Kalman filters Modelling systems described by a set of continuous variables, e.g., tracking a bird flying X t = X, Y, Z, Ẋ, Ẏ, Ż. Airplanes, robots, ecosystems, economies, chemical plants, planets,... Xt Xt+1 Xt Xt+1 Zt Zt+1 Gaussian prior, linear Gaussian transition model and sensor model 25

58 Updating Gaussian distributions Prediction step: if P(X t e 1:t ) is Gaussian, then prediction P(X t+1 e 1:t ) = P(X t+1 x t )P(x t e 1:t ) dx t x t is Gaussian. If P(X t+1 e 1:t ) is Gaussian, then the updated distribution P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) is Gaussian Hence P(X t e 1:t ) is multivariate Gaussian N(µ t, Σ t ) for all t 26

59 Updating Gaussian distributions Prediction step: if P(X t e 1:t ) is Gaussian, then prediction P(X t+1 e 1:t ) = P(X t+1 x t )P(x t e 1:t ) dx t x t is Gaussian. If P(X t+1 e 1:t ) is Gaussian, then the updated distribution P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) is Gaussian Hence P(X t e 1:t ) is multivariate Gaussian N(µ t, Σ t ) for all t General (nonlinear, non-gaussian) process: description of posterior grows unboundedly as t 26

60 2-D tracking example: filtering 12 2D filtering 11 true observed filtered 10 Y X 27

61 2-D tracking example: smoothing D smoothing true observed smoothed 10 Y X 28

62 Where it breaks Cannot be applied if the transition model is nonlinear Extended Kalman Filter models transition as locally linear around x t = µ t Fails if systems is locally unsmooth 29

63 Dynamic Bayesian networks X t, E t contain arbitrarily many variables in a replicated Bayes net BMeter 1 P(R ) R 0 t f P(R 1) Rain 0 Rain 1 Battery 0 Battery 1 R 1 t f P(U 1) X 0 X 1 Umbrella 1 txx 0 X 1 Z 1 30

64 DBNs vs. HMMs Every HMM is a single-variable DBN; every discrete DBN is an HMM X t X t+1 Y t Yt+1 Z t Z t+1 Sparse dependencies exponentially fewer parameters; e.g., 20 state variables, three parents each DBN has = 160 parameters, HMM has

65 DBNs vs Kalman filters Every Kalman filter model is a DBN, but few DBNs are KFs; real world requires non-gaussian posteriors 32

66 Summary Temporal models use state and sensor variables replicated over time Markov assumptions and stationarity assumption, so we need transition model P(X t X t 1 ) sensor model P(E t X t ) Tasks are filtering, prediction, smoothing, most likely sequence; all done recursively with constant cost per time step Hidden Markov models have a single discrete state variable; used for speech recognition Kalman filters allow n state variables, linear Gaussian, O(n 3 ) update Dynamic Bayes nets subsume HMMs, Kalman filters; exact update intractable 33

67 Outline

68 Outline Speech as probabilistic inference Speech sounds Word pronunciation Word sequences 35

69 Speech as probabilistic inference Speech signals are noisy, variable, ambiguous 36

70 Speech as probabilistic inference Speech signals are noisy, variable, ambiguous What is the most likely word sequence, given the speech signal? I.e., choose Words to maximize P(Words signal) 36

71 Speech as probabilistic inference Speech signals are noisy, variable, ambiguous What is the most likely word sequence, given the speech signal? I.e., choose Words to maximize P(Words signal) Use Bayes rule: P(Words signal) = αp(signal Words)P(Words) I.e., decomposes into acoustic model + language model 36

72 Speech as probabilistic inference Speech signals are noisy, variable, ambiguous What is the most likely word sequence, given the speech signal? I.e., choose Words to maximize P(Words signal) Use Bayes rule: P(Words signal) = αp(signal Words)P(Words) I.e., decomposes into acoustic model + language model Words are the hidden state sequence, signal is the observation sequence 36

73 Phones All human speech is composed from phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal acoustic model = pronunciation model + phone model ARPAbet designed for American English [iy] beat [b] bet [p] pet [ih] bit [ch] Chet [r] rat [ey] bet [d] debt [s] set [ao] bought [hh] hat [th] thick [ow] boat [hv] high [dh] that [er] Bert [l] let [w] wet [ix] roses [ng] sing [en] button.... E.g., ceiling is [s iy l ih ng] / [s iy l ix ng] / [s iy l en].. 37

74 Word pronunciation models Each word is described as a distribution over phone sequences Distribution represented as an HMM transition model [t] 0.2 [ow] 1.0 [m] 0.5 [ey] 1.0 [t] 1.0 [ow] 0.8 [ah] [aa] 1.0 P([towmeytow] tomato ) = P([towmaatow] tomato ) = 0.1 P([tahmeytow] tomato ) = P([tahmaatow] tomato ) = 0.4 Structure is created manually, transition probabilities learned from data 41

75 Isolated words Phone models + word models fix likelihood P(e 1:t word) for isolated word P(word e 1:t ) = αp(e 1:t word)p(word) Prior probability P(word) obtained simply by counting word frequencies P(e 1:t word) can be computed recursively: define l 1:t = P(X t, e 1:t ) and use the recursive update l 1:t+1 = Forward(l 1:t, e t+1 ) and then P(e 1:t word) = x t l 1:t (x t ) Isolated-word dictation systems with training reach 95 99% accuracy 42

76 Continuous speech Not just a sequence of isolated-word recognition problems! Adjacent words highly correlated Sequence of most likely words most likely sequence of words Segmentation: there are few gaps in speech Cross-word coarticulation e.g., next thing Continuous speech systems manage 60 80% accuracy on a good day 43

77 Language model Prior probability of a word sequence is given by chain rule: P(w 1 w n ) = Bigram model: n P(w i w 1 w i 1 ) i=1 P(w i w 1 w i 1 ) P(w i w i 1 ) Train by counting all word pairs in a large text corpus More sophisticated models (trigrams, grammars, etc.) help a little bit 44

78 Combined HMM States of the combined language+word+phone model are labelled by the word we re in + the phone in that word + the phone state in that phone Viterbi algorithm finds the most likely phone state sequence Does segmentation by considering all possible word sequences and boundaries Doesn t always give the most likely word sequence because each word sequence is the sum over many state sequences Jelinek invented A in 1969 a way to find most likely word sequence where step cost is log P(w i w i 1 ) 45

79 Outline

80 Outline agents Inductive learning Decision tree learning Measuring learning performance 47

81 Back to Turing s article: - child mind program - education Reward & Punishment is essential for unknown environments, i.e., when designer lacks omniscience is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down modifies the agent s decision mechanisms to improve performance 48

82 agents Performance standard Critic Sensors feedback Agent learning goals element Problem generator changes knowledge experiments Performance element Effectors Environment 49

83 element Design of learning element is dictated by what type of performance element is used which functional component is to be learned how that functional compoent is represented what kind of feedback is available 50

84 element Design of learning element is dictated by what type of performance element is used which functional component is to be learned how that functional compoent is represented what kind of feedback is available Example scenarios: Performance element Component Representation Feedback Alpha beta search Eval. fn. Weighted linear function Win/loss Logical agent Transition model Successor state axioms Outcome Utility based agent Transition model Dynamic Bayes net Outcome Simple reflex agent Percept action fn Neural net Correct action 50

85 element Design of learning element is dictated by what type of performance element is used which functional component is to be learned how that functional compoent is represented what kind of feedback is available Example scenarios: Performance element Component Representation Feedback Alpha beta search Eval. fn. Weighted linear function Win/loss Logical agent Transition model Successor state axioms Outcome Utility based agent Transition model Dynamic Bayes net Outcome Simple reflex agent Percept action fn Neural net Correct action Supervised learning: correct answers for each instance Reinforcement learning: occasional rewards 50

Course Overview. Summary. Performance of approximation algorithms

Course Overview. Summary. Performance of approximation algorithms Course Overview Lecture 11 Dynamic ayesian Networks and Hidden Markov Models Decision Trees Marco Chiarandini Deptartment of Mathematics & Computer Science University of Southern Denmark Slides by Stuart