Course Overview. Summary. Performance of approximation algorithms

Similar documents
Dynamic Bayesian Networks and Hidden Markov Models Decision Trees

PROBABILISTIC REASONING OVER TIME

Hidden Markov models 1

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Temporal probability models. Chapter 15, Sections 1 5 1

Temporal probability models. Chapter 15

Bayesian Networks BY: MOHAMAD ALSABBAGH

Artificial Intelligence Markov Chains

Inference in Bayesian Networks

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

STA 4273H: Statistical Machine Learning

Approximate Inference

Hidden Markov Models (recap BNs)

Artificial Intelligence

PROBABILISTIC REASONING Outline

Outline. Uncertainty. Methods for handling uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Uncertainty

CSEP 573: Artificial Intelligence

CS 380: ARTIFICIAL INTELLIGENCE

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Uncertainty. Outline. Probability Syntax and Semantics Inference Independence and Bayes Rule. AIMA2e Chapter 13

CSE 473: Artificial Intelligence

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II

Introduction to Artificial Intelligence (AI)

Uncertain Knowledge and Reasoning

CS532, Winter 2010 Hidden Markov Models

Uncertainty. Chapter 13, Sections 1 6

Artificial Intelligence

CS 7180: Behavioral Modeling and Decision- making in AI

Introduction to Artificial Intelligence (AI)

Probabilistic Reasoning

CS 188: Artificial Intelligence Fall 2011

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Extensions of Bayesian Networks. Outline. Bayesian Network. Reasoning under Uncertainty. Features of Bayesian Networks.

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CS 188: Artificial Intelligence Spring Announcements

STA 414/2104: Machine Learning

CS 188: Artificial Intelligence. Bayes Nets

Learning from Observations. Chapter 18, Sections 1 3 1

Inference in Bayesian networks

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

Linear Dynamical Systems

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Methods in Artificial Intelligence

CS 188: Artificial Intelligence Spring 2009

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence

Uncertainty. Chapter 13. Chapter 13 1

Markov Chains and Hidden Markov Models

CS 188: Artificial Intelligence Fall Recap: Inference Example

CS 343: Artificial Intelligence

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Intelligent Systems (AI-2)

CS 4100 // artificial intelligence. Recap/midterm review!

CSE 473: Artificial Intelligence Spring 2014

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Intelligent Systems (AI-2)

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Uncertainty and Bayesian Networks

CS 5522: Artificial Intelligence II

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

CS 561: Artificial Intelligence

Bayes Nets III: Inference

Pengju XJTU 2016

CS 5522: Artificial Intelligence II

CS 343: Artificial Intelligence

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Hidden Markov Models

CS188 Outline. We re done with Part I: Search and Planning! Part II: Probabilistic Reasoning. Part III: Machine Learning

Outline } Exact inference by enumeration } Exact inference by variable elimination } Approximate inference by stochastic simulation } Approximate infe

Chapter 13 Quantifying Uncertainty

Intelligent Systems (AI-2)

Linear Dynamical Systems (Kalman filter)

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Our Status in CSE 5522

Introduction to Machine Learning CMU-10701

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Intelligent Agents. Pınar Yolum Utrecht University

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Sampling from Bayes Nets

Lecture 15. Probabilistic Models on Graph

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

CS 188: Artificial Intelligence. Our Status in CS188

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov models

Lecture : Probabilistic Machine Learning

Quantifying uncertainty & Bayesian networks

CSE 473: Artificial Intelligence

Transcription:

Course Overview Lecture 11 Dynamic ayesian Networks and Hidden Markov Models Decision Trees Marco Chiarandini Deptartment of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Introduction Artificial Intelligence Intelligent Agents Search Uninformed Search Heuristic Search Adversarial Search Minimax search Alpha-beta pruning Knowledge representation and Reasoning Propositional logic First order logic Inference Uncertain knowledge and Reasoning Probability and ayesian approach ayesian Networks Hidden Markov Chains Kalman Filters Decision Trees Maximum Likelihood EM Algorithm ayesian Networks Neural Networks Support vector machines 2 Performance of approximation algorithms Summary Absolute approximation: P(X e) ˆP(X e) ɛ Relative approximation: P(X e) ˆP(X e) P(X e) ɛ Relative = absolute since 0 P 1 (may be O(2 n )) Randomized algorithms may fail with probability at most δ Polytime approximation: poly(n, ɛ 1, log δ 1 ) Theorem (Dagum and Luby, 1993): both absolute and relative approximation for either deterministic or randomized algorithms are NP-hard for any ɛ, δ < 0.5 (Absolute approximation polytime with no evidence Chernoff bounds) Exact inference by variable elimination: polytime on polytrees, NP-hard on general graphs space = time, very sensitive to topology Approximate inference by Likelihood Weighting (LW), Markov Chain Monte Carlo Method (MCMC): PriorSampling and RejectionSampling unusable as evidence grow LW does poorly when there is lots of (late-in-the-order) evidence LW, MCMC generally insensitive to topology Convergence can be very slow with probabilities close to 1 or 0 Can handle arbitrary combinations of discrete and continuous variables 3 4

Outline Wumpus World 1,4 2,4 3,4 4,4 1. 1,3 2,3 3,3 4,3 2. 3. 4. 1,2 2,2 3,2 4,2 1,1 2,1 3,1 4,1 P ij = iff [i, j] contains a pit ij = iff [i, j] is breezy Include only 1,1, 1,2, 2,1 in the probability model 5 6 Specifying the probability model Observations and query The full joint distribution is P(P 1,1,..., P 4,4, 1,1, 1,2, 2,1 ) Apply product rule: P( 1,1, 1,2, 2,1 P 1,1,..., P 4,4 )P(P 1,1,..., P 4,4 ) (Do it this way to get P(Effect Cause).) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: P(P 1,1,..., P 4,4 ) = for n pits. 4,4 i,j = 1,1 P(P i,j ) = 0.2 n 0.8 16 n We know the following facts: b = b 1,1 b 1,2 b 2,1 known = p 1,1 p 1,2 p 2,1 Query is P(P 1,3 known, b) Define Unknown = P ij s other than P 1,3 and Known For inference by enumeration, we have P(P 1,3 known, b) = α P(P 1,3, unknown, known, b) unknown Grows exponentially with number of squares! 7 8

Using conditional independence Using conditional independence contd. asic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 OTHER QUERY 1,2 2,2 3,2 4,2 1,1 2,1 FRINGE 3,1 4,1 KNOWN Define Unknown = Fringe Other P(b P 1,3, Known, Unknown) = P(b P 1,3, Known, Fringe) Manipulate query into a form where we can use this! P(P 1,3 known, b) = α X P(P 1,3, unknown, known, b) unknown = α X P(b P 1,3, known, unknown)p(p 1,3, known, unknown) unknown = α X = α X fringe other fringe other X P(b known, P 1,3, fringe, other)p(p 1,3, known, fringe, other) X P(b known, P 1,3, fringe)p(p 1,3, known, fringe, other) = α fringep(b known, X P 1,3, fringe) X P(P 1,3, known, fringe, other) other = α X P(b known, P 1,3, fringe) X P(P 1,3 )P(known)P(fringe)P(other) fringe other = α P(known)P(P 1,3 ) X P(b known, P 1,3, fringe)p(fringe) X P(other) fringe other = α P(P 1,3 ) X P(b known, P 1,3, fringe)p(fringe) fringe 9 10 Using conditional independence contd. Outline 1,3 1,2 2,2 1,3 1,2 2,2 1,3 1,2 2,2 1,3 1,2 2,2 1,3 1,2 2,2 1. 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 1,1 2,1 3,1 1,1 2,1 3,1 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 2. P(P 1,3 known, b) = α 0.2(0.04 + 0.16 + 0.16), 0.8(0.04 + 0.16) 0.31, 0.69 P(P 2,2 known, b) 0.86, 0.14 3. 4. 11 12

Outline Time and uncertainty The world changes; we need to track and predict it Time and uncertainty Inference: filtering, prediction, smoothing Hidden Markov models Kalman filters (a brief mention) Dynamic ayesian networks (an even briefer mention) Diabetes management vs vehicle diagnosis asic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., loodsugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredloodSugar t, PulseRate t, FoodEaten t This assumes discrete time; step size depends on problem Notation: X a:b = X a, X a+1,..., X b 1, X b 13 14 Markov processes (Markov chains) Example Construct a ayes net from these variables: - unbounded number of conditional probability table - unbounded number of parents Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 0:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 0:t 1 ) = P(X t X t 2, X t 1 ) First order X t 2 X t 1 X t X t +1 X t +2 Rain t 1 Umbrella t 1 R t 1 t f P(R t ) 0.7 0.3 Rain t Rt t f Umbrella t P(U t ) 0.9 0.2 Rain t +1 Umbrella t +1 Second order X t 2 X t 1 X t X t +1 X t +2 Sensor Markov assumption: P(E t X 0:t, E 0:t 1 ) = P(E t X t ) Stationary process: transition model P(X t X t 1 ) and sensor model P(E t X t ) fixed for all t First-order Markov assumption not exactly in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state, e.g., add Temp t, Pressure t Example: robot motion. Augment position and velocity with attery t 15 16

Inference tasks Filtering Aim: devise a recursive state estimation algorithm: 1. Filtering: P(X t e 1:t ) belief state input to the decision process of a rational agent 2. Prediction: P(X t+k e 1:t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence 3. Smoothing: P(X k e 1:t ) for 0 k < t better estimate of past states, essential for learning 4. Most likely explanation: arg max x1:t P(x 1:t e 1:t ) speech recognition, decoding with a noisy channel P(X t+1 e 1:t+1 ) = f (e t+1, P(X t e 1:t )) P(X t+1 e 1:t+1 ) = P(X t+1 e 1:t, e t+1 ) = αp(e t+1 X t+1, e 1:t )P(X t+1 e 1:t ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) I.e., prediction + estimation. Prediction by summing out X t : P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 ) P(X t+1 x t, e 1:t )P(x t e 1:t ) x t = αp(e t+1 X t+1 ) P(X t+1 x t )P(x t e 1:t ) x t 17 f 1:t+1 = Forward(f 1:t, e t+1 ) where f 1:t = P(X t e 1:t ) Time and space constant (independent of t) by keeping track of f 18 Filtering example Smoothing True False Rain t 1 Umbrella t 1 Rain 0 R t 1 t f P(R t ) 0.7 0.3 Rain t Rt t f Umbrella t 0.818 0.182 Rain 1 P(U t ) 0.9 0.2 Rain t +1 Umbrella t +1 0.627 0.373 0.883 0.117 Rain 2 X 0 X 1 E1 Divide evidence e 1:t into e 1:k, e k+1:t : X k P(X k e 1:t ) = P(X k e 1:k, e k+1:t ) E k X t Et = αp(x k e 1:k )P(e k+1:t X k, e 1:k ) = αp(x k e 1:k )P(e k+1:t X k ) = αf 1:k b k+1:t ackward message computed by a backwards recursion: P(e k+1:t X k ) = X xk+1 P(e k+1:t X k, x k+1 )P(x k+1 X k ) Umbrella 1 Umbrella 2 = X xk+1 P(e k+1:t x k+1 )P(x k+1 X k ) 19 = X xk+1 P(e k+1 x k+1 )P(e k+2:t x k+1 )P(x k+1 X k ) 20

Smoothing example Most likely explanation True False 0.818 0.182 0.627 0.373 0.883 0.117 forward Most likely sequence sequence of most likely states (joint distr.)! Most likely path to each x t+1 = most likely path to some x t plus one more step 0.883 0.117 0.690 0.410 0.883 0.117 1.000 1.000 smoothed backward max P(x 1,..., x t, X t+1 e 1:t+1 ) x 1...x t ( ) = P(e t+1 X t+1 ) max P(X t+1 x t ) max P(x 1,..., x t 1, x t e 1:t ) x t x 1...x t 1 Rain 0 Rain 1 Rain 2 Identical to filtering, except f 1:t replaced by m 1:t = max P(x 1,..., x t 1, X t e 1:t ), x 1...x t 1 Umbrella 1 Umbrella 2 I.e., m 1:t (i) gives the probability of the most likely path to state i. Update has sum replaced by max, giving the Viterbi algorithm: If we want to smooth the whole sequence: Forward backward algorithm: cache forward messages along the way Time linear in t (polytree inference), space O(t f ) m 1:t+1 = P(e t+1 X t+1 ) max x t (P(X t+1 x t )m 1:t ) 21 22 Viterbi example Hidden Markov models state space paths most likely paths Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 false false false false false umbrella false.8182.5155.0361.0334.0210.1818.0491.1237.0173.0024 m 1:1 m 1:2 m 1:3 m 1:4 m 1:5 X t is a single, discrete variable (usually E t is too) Domain of X t is {1,..., S} ( ) 0.7 0.3 Transition matrix T ij = P(X t = j X t 1 = i), e.g., 0.3 0.7 Sensor matrix O t for each time ( step, diagonal ) elements P(e t X t = i) 0.9 0 e.g., with U 1 =, O 1 = 0 0.2 Forward and backward messages as column vectors: f 1:t+1 = αo t+1 T f 1:t b k+1:t = TO k+1 b k+2:t Forward-backward algorithm needs time O(S 2 t) and space O(St) 23 24

Kalman filters Updating Gaussian distributions Modelling systems described by a set of continuous variables, e.g., tracking a bird flying X t = X, Y, Z, Ẋ, Ẏ, Ż. Airplanes, robots, ecosystems, economies, chemical plants, planets,... Xt Xt+1 Xt Xt+1 Zt Zt+1 Prediction step: if P(X t e 1:t ) is Gaussian, then prediction P(X t+1 e 1:t ) = P(X t+1 x t )P(x t e 1:t ) dx t x t is Gaussian. If P(X t+1 e 1:t ) is Gaussian, then the updated distribution P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) is Gaussian Hence P(X t e 1:t ) is multivariate Gaussian N(µ t, Σ t ) for all t General (nonlinear, non-gaussian) process: description of posterior grows unboundedly as t Gaussian prior, linear Gaussian transition model and sensor model 25 26 2-D tracking example: filtering 2-D tracking example: smoothing 12 2D filtering 12 2D smoothing 11 observed filtered 11 observed smoothed 10 10 Y 9 Y 9 8 8 7 7 6 8 10 12 14 16 18 20 22 24 26 X 6 8 10 12 14 16 18 20 22 24 26 X 27 28

Where it breaks Dynamic ayesian networks Cannot be applied if the transition model is nonlinear Extended Kalman Filter models transition as locally linear around x t = µ t Fails if systems is locally unsmooth X t, E t contain arbitrarily many variables in a replicated ayes net Meter 1 P(R ) 0 0.7 R 0 t f P(R 1) 0.7 0.3 Rain 0 Rain 1 attery 0 attery 1 R 1 t f P(U 1) 0.9 0.2 X 0 X 1 Umbrella 1 txx 0 X 1 Z 1 29 30 DNs vs. HMMs DNs vs Kalman filters Every HMM is a single-variable DN; every discrete DN is an HMM X t X t+1 Y t Yt+1 Every Kalman filter model is a DN, but few DNs are KFs; real world requires non-gaussian posteriors Z t Z t+1 Sparse dependencies exponentially fewer parameters; e.g., 20 state variables, three parents each DN has 20 2 3 = 160 parameters, HMM has 2 20 2 20 10 12 31 32

Summary Outline Temporal models use state and sensor variables replicated over time Markov assumptions and stationarity assumption, so we need transition model P(X t X t 1 ) sensor model P(E t X t ) Tasks are filtering, prediction, smoothing, most likely sequence; all done recursively with constant cost per time step Hidden Markov models have a single discrete state variable; used for speech recognition Kalman filters allow n state variables, linear Gaussian, O(n 3 ) update 1. 2. 3. 4. Dynamic ayes nets subsume HMMs, Kalman filters; exact update intractable 33 34 Outline Speech as probabilistic inference Speech signals are noisy, variable, ambiguous Speech as probabilistic inference Speech sounds Word pronunciation Word sequences What is the most likely word sequence, given the speech signal? I.e., choose Words to maximize P(Words signal) Use ayes rule: P(Words signal) = αp(signal Words)P(Words) I.e., decomposes into acoustic model + language model Words are the hidden state sequence, signal is the observation sequence 35 36

Phones Word pronunciation models All human speech is composed from 40-50 phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal acoustic model = pronunciation model + phone model ARPAbet designed for American English [iy] beat [b] bet [p] pet [ih] bit [ch] Chet [r] rat [ey] bet [d] debt [s] set [ao] bought [hh] hat [th] thick [ow] boat [hv] high [dh] that [er] ert [l] let [w] wet [ix] roses [ng] sing [en] button.... E.g., ceiling is [s iy l ih ng] / [s iy l ix ng] / [s iy l en].. Each word is described as a distribution over phone sequences Distribution represented as an HMM transition model [t] 0.2 0.8 [ow] [ah] 1.0 1.0 [m] 0.5 0.5 [ey] [aa] 1.0 1.0 [t] 1.0 [ow] P([towmeytow] tomato ) = P([towmaatow] tomato ) = 0.1 P([tahmeytow] tomato ) = P([tahmaatow] tomato ) = 0.4 Structure is created manually, transition probabilities learned from data 37 41 Isolated words Continuous speech Phone models + word models fix likelihood P(e 1:t word) for isolated word P(word e 1:t ) = αp(e 1:t word)p(word) Prior probability P(word) obtained simply by counting word frequencies P(e 1:t word) can be computed recursively: define l 1:t = P(X t, e 1:t ) Not just a sequence of isolated-word recognition problems! Adjacent words highly correlated Sequence of most likely words most likely sequence of words Segmentation: there are few gaps in speech Cross-word coarticulation e.g., next thing Continuous speech systems manage 60 80% accuracy on a good day and use the recursive update l 1:t+1 = Forward(l 1:t, e t+1 ) and then P(e 1:t word) = x t l 1:t (x t ) Isolated-word dictation systems with training reach 95 99% accuracy 42 43

Language model Combined HMM Prior probability of a word sequence is given by chain rule: P(w 1 w n ) = igram model: n P(w i w 1 w i 1 ) i=1 P(w i w 1 w i 1 ) P(w i w i 1 ) Train by counting all word pairs in a large text corpus More sophisticated models (trigrams, grammars, etc.) help a little bit States of the combined language+word+phone model are labelled by the word we re in + the phone in that word + the phone state in that phone Viterbi algorithm finds the most likely phone state sequence Does segmentation by considering all possible word sequences and boundaries Doesn t always give the most likely word sequence because each word sequence is the sum over many state sequences Jelinek invented A in 1969 a way to find most likely word sequence where step cost is log P(w i w i 1 ) 44 45 Outline Outline 1. 2. 3. agents Inductive learning Decision tree learning Measuring learning performance 4. 46 47

agents Performance standard ack to Turing s article: - child mind program - education Critic Sensors Reward & Punishment is essential for unknown environments, i.e., when designer lacks omniscience is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down feedback learning goals element changes knowledge Performance element Environment modifies the agent s decision mechanisms to improve performance Problem generator experiments Agent Effectors 48 49 element Design of learning element is dictated by what type of performance element is used which functional component is to be learned how that functional compoent is represented what kind of feedback is available Example scenarios: Performance element Component Representation Feedback Alpha beta search Eval. fn. Weighted linear function Win/loss Logical agent Transition model Successor state axioms Outcome Utility based agent Transition model Dynamic ayes net Outcome Simple reflex agent Percept action fn Neural net Correct action Supervised learning: correct answers for each instance Reinforcement learning: occasional rewards 50