CS 7180: Behavioral Modeling and Decision- making in AI

Similar documents
CS 7180: Behavioral Modeling and Decision- making in AI

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Advanced Data Science

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models Part 2: Algorithms

STA 414/2104: Machine Learning

Introduction to Machine Learning CMU-10701

STA 4273H: Statistical Machine Learning

Lecture 11: Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Basic math for biology

Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

CSE 473: Artificial Intelligence Autumn Topics

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Lecture 12: Algorithms for HMMs

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Graphical Models for Collaborative Filtering

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 9. Intro to Hidden Markov Models (finish up)

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Parametric Models Part III: Hidden Markov Models

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Intelligent Systems (AI-2)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov models 1

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models and Gaussian Mixture Models

CS 7180: Behavioral Modeling and Decisionmaking

A brief introduction to Conditional Random Fields

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Linear Dynamical Systems (Kalman filter)

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Intelligent Systems (AI-2)

Statistical Processing of Natural Language

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Lecture 12: Algorithms for HMMs

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

CSCE 471/871 Lecture 3: Markov Chains and

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Note Set 5: Hidden Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

HMM: Parameter Estimation

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Probability. CS 3793/5233 Artificial Intelligence Probability 1

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Statistical learning. Chapter 20, Sections 1 4 1

Hidden Markov Models: All the Glorious Gory Details

Hidden Markov Models (I)

Hidden Markov Models,99,100! Markov, here I come!

Machine Learning for natural language processing

CS 188: Artificial Intelligence Fall Recap: Inference Example

Expectation Maximization [KF Chapter 19] Incomplete data

Hidden Markov Models

Hidden Markov Models

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Hidden Markov Models

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Reasoning Under Uncertainty Over Time. CS 486/686: Introduction to Artificial Intelligence

Introduction to Markov systems

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

CS 343: Artificial Intelligence

Hidden Markov Models in Language Processing

Math 350: An exploration of HMMs through doodles.

Hidden Markov Models. x 1 x 2 x 3 x K

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Hidden Markov Modelling

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models Hamid R. Rabiee

Bayesian Networks BY: MOHAMAD ALSABBAGH

p(d θ ) l(θ ) 1.2 x x x

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Machine Learning Summer School

Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip

Lecture 7 Sequence analysis. Hidden Markov Models

CS 343: Artificial Intelligence

Pair Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

CS Homework 3. October 15, 2009

Stephen Scott.

Log-Linear Models, MEMMs, and CRFs

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Data-Intensive Computing with MapReduce

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov models

Transcription:

CS 7180: Behavioral Modeling and Decision- making in AI Learning Probabilistic Graphical Models Prof. Amy Sliva October 31, 2012

Hidden Markov model Stochastic system represented by three matrices N = number of states Q ={q 1,,q T } M = number of observations O = {o 1,,o T } A = transition model a ij = P(q t+1 = j q t = i) B = observation model b j (k)= P(o 1 = k q t = j) π = prior state probabilities Environmental context sequence of states from times 1- T Sequence of evidence the agent observes at times 1- T State transition probability matrix Probability distribution over observations (probability of seeing observation o in state q) Probability distribution that state q is the start state Full HMM is a triple λ = (A,B,π) First- order Markov transition model Stationary transition and observation model

Graphical representabon of HMMs R t -1 Rain t 1 Umbrella t 1 t f P(R t) 0.7 0.3 Rain t R t f Umbrella t t P(U t ) 0.9 0.2 Rain t+1 Umbrella t+1 Hidden States Observed Evidence Assume each state is represented by a single random variable Same as Bayesian networks If states have more than one variable, use state space representation Megavariable for state equal to tuple of all values of the individual variables

Matrix representabon of HMM HMM is a triple λ = (A,B,π) P(q) q 1 P(q 1 ) q 2 P(q 2 ) q 1 q 2 q n q 1 P(q 1 q 1 ) q 2 P(q 2 q 1 ) P(q 1 q 2 ) P(q 2 q 2 ) P(q 1 q n ) q 1 q 2 q n o 1 P(o 1 q 1 ) P(o 1 q 2 ) P(o 1 q n )...... o 2 P(o 2 q 1 ) P(o 2 q 2 ) q m P(q m ) q n P(q n q 1 ) P(q n q n )... o m P(o m q 1 ) P(o m q n )

Three basic HMM problems Evaluation Given observation sequence O = {o 1,,o T } and an HMM λ = (A,B,π) how do we compute the probability of O given the model P(O λ) Forward and Backward algorithms Decoding Given observation sequence O = {o 1,,o T } and an HMM λ = (A,B,π) how do we Vind the state sequence Q ={q 1,,q T } that best explains the observations argmax Q P(Q O, λ) Viterbi algorithm Learning How do we adjust the model parameters λ = (A,B,π) to best Vit the sequence argmax λ P(O λ)

Learning the parameters of an HMM Previously assumed underlying model λ = (A,B,π) Where do these parameters come from? Estimate transition, observation, and prior probabilities from annotated training data Drawbacks of this parameter estimation?

Learning the parameters of an HMM Previously assumed underlying model λ = (A,B,π) Where do these parameters come from? Estimate transition, observation, and prior probabilities from annotated training data Drawbacks of this parameter estimation? Annotation is difeicult and/or expensive Training data is different from the current data

Learning the parameters of an HMM Previously assumed underlying model λ = (A,B,π) Where do these parameters come from? Estimate transition, observation, and prior probabilities from annotated training data Drawbacks of this parameter estimation? Annotation is difeicult and/or expensive Training data is different from the current data Want to maximize the parameters w.r.t. the current data, i.e., Looking for model λ, such that λ = argmax λ P(O λ) Want to train HMM to encode observation sequence O s.t. similar sequence will be identivied in the future

How can we find the maximal model? Unfortunately, no known way to analytically Vind a global maximum i.e., The model λ, such that λ = argmax λ P(O λ) But it is possible to Vind a local maximum Given initial model λ, we can always Vind a model λ, s.t. P(O λ ) P(O λ) Process called parameter re- estimation

Parameter re- esbmabon in HMMs Three parameters need to be re- estimated: Initial state distribution: π i Transition probabilities: a ij Emission (observation) probabilities: b i (o t ) Use forward- backward (or Baum- Welch) algorithm Hill- climbing algorithm

Parameter re- esbmabon in HMMs Three parameters need to be re- estimated: Initial state distribution: π i Transition probabilities: a ij Emission (observation) probabilities: b i (o t ) Use forward- backward (or Baum- Welch) algorithm Hill- climbing algorithm Choose arbitrary initial parameter instantiation Forward- backward algorithm iteratively re- estimates parameters Improves probability that the given observations are generated by the new parameters

General Baum- Welch algorithm 1. Initialise: λ 0 2. Compute new model λ, using λ 0 and observed sequence O 3. Set λ 0 ç λ 4. Repeat steps 2 and 3 until: log P(O λ) - log P(O λ 0 ) < d

Re- esbmabng the transibon probabilibes What is the probability of being in state i at time t and going to state j, given the current model and parameters? ξ(i,j) = P(q t = i, q t+1 = j O, λ) Use the forward and backward probabilities to compute the re- estimated transition probability

Re- esbmabng the transibon probabilibes Estimated transition probability: ξ(i,j) = P(q t = i, q t+1 = j O, λ) ξ(i,j) = α t (i) a ij b j (o t+1 ) β t+1 (j) Σ N i=1 ΣN j=1 α t (i) a ij b i (o t+1 ) β t+1 (j)

Re- esbmabng the transibon probabilibes Estimated transition probability: ξ(i,j) = P(q t = i, q t+1 = j O, λ) ξ(i,j) = α t (i) a ij b j (o t+1 ) β t+1 (j) Σ N i=1 ΣN j=1 α t (i) a ij b i (o t+1 ) β t+1 (j) Forward probability of seeing observations so far, being in state i, and transitioning to state j

Re- esbmabng the transibon probabilibes Estimated transition probability: ξ(i,j) = P(q t = i, q t+1 = j O, λ) ξ(i,j) = α t (i) a ij b j (o t+1 ) β t+1 (j) Σ N i=1 ΣN j=1 α t (i) a ij b i (o t+1 ) β t+1 (j) Probability of seeing observation o t+1 in state j and the backward probability of seeing the rest of the sequence from j to the end

Re- esbmabng the transibon probabilibes Estimated transition probability: ξ(i,j) = P(q t = i, q t+1 = j O, λ) ξ(i,j) = α t (i) a ij b j (o t+1 ) β t+1 (j) Σ N i=1 ΣN j=1 α t (i) a ij b i (o t+1 ) β t+1 (j) Sum over all states i and j of seeing observations so far, transitioning to j, and seeing rest of observations to the end

EsBmaBng each transibon probability Intuition behind re- estimation of transition probabilities â ij = expected number of transitions from state i to j expected number of transitions from state i Formally â ij = Σ T- 1 t=1 ξ t (i,j) Σ T- 1 t=1 ΣN j =1 ξ t (i,j )

EsBmaBng each transibon probability DeVine state probability of being in state i given the complete observation sequence O γ t (i) = ΣN j=1 ξ t (i,j) Sums over forward probability of i and all possible transitions and backward probabilities Simplify transition re- estimation: â ij = Σ T- 1 t=1 ξ t (i,j) Σ T- 1 t=1 γ t (i)

Review of probabilibes Forward probability α t (i) Backward probability β t (i) Transition probability ξ t (i,j) State probability γ t (i) Probability of being in state i, given the partial observation o 1,,o t Probability of being in state i, given the partial observation o t+1,,o T Probability of going from state i, to state j, given the complete observation o 1,,o T Probability of being in state i, given the complete observation o 1,,o T

Re- esbmabng the observabon probabilibes Observation/emission probabilities re- estimated as b ^ i (k) = expected number of times in state i where observation is k expected number times in state i Formally ^ b i (k) = ΣT t=1 δ(o t,k) γ t (i) Σ T t=1 γ t (i) Where δ(o t,k) = 1 iff o t = k, and 0 otherwise Note: δ here is the Kronecker delta function and is not related to the δ in the discussion of the Viterbi algorithm!!

Re- esbmabng inibal state probabilibes Initial state distribution π i probability that i is start state Re- estimation is easy and intuitive π^ i = expected number of times in i at time 1 Formally π^ i = γ 1 (i)

UpdaBng the model Coming from λ = (A,B,π) we get to λ = (A,B,π ) using the following update rules: Transitions â ij = Σ T- 1 t=1 ξ t (i,j) Σ T- 1 t=1 γ t (i) Observations ^ b i (k) = ΣT t=1 δ(o t,k) γ t (i) Σ T t=1 γ t (i) Priors π^ i = γ 1 (i)

ExpectaBon maximizabon Baum- Welch is example of expectation maximization (EM) Expectation maximization has two steps: 1. E step use current λ to estimate the state sequence Compute the expected value for the state at time t: γ t (i) Compute the expected value of state transitions from i to j: ξ t (i,j) Both computed using the forward and backward probabilities 2. M step compute new parameters λ as expected values given the state sequence estimated in E step Update â ij, b ^ i (k), and π^ i using rules Maximum likelihood (ML) estimation maximizes P(O λ )

What about the structure of the model? Baum- Welch assumes we know the structure of the HMM Hidden states Observations Conditional independence relationships How can we learn the structure? Fully observable domains Vind maximum likelihood model for data Partially observable domains use EM Applicable to all probabilistic graphical models

Structure learning of graphical models Search thru the space of possible network structures! For now, assume we observe all variables For each structure, learn parameters using EM Pick the one that Eits observed data best Caveat won t we end up fully connected? Add a penalty for model complexity Problems?

Structure learning of graphical models Search thru the space of possible network structures! For now, assume we observe all variables For each structure, learn parameters using EM Pick the one that Eits observed data best Caveat won t we end up fully connected? Add a penalty for model complexity Problems? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So now what?

Learn structure with local search Start with some network structure λ Make a change to the structure i.e., add, delete, or reverse an edge See if the new network is any better, i.e., is P(O λ ) P(O λ)? Same principle as using EM for parameter learning What should the initial state be?

Learn structure with local search Start with some network structure λ Make a change to the structure i.e., add, delete, or reverse an edge See if the new network is any better, i.e., is P(O λ ) P(O λ)? Same principle as using EM for parameter learning What should the initial state be? Uniform distribution over initial networks Network that revlects domain expertise

Process of learning network structure Prior Network Improved Network(s) X 1 X 2 X 3 X 4 X 5 X 1 X 2 X 6 X 3 X 4 X 5 X7 X 8 X 6 Data X9 X 8 X7 X9

Unknown structure, full observability Local search over structures Search space possible states, connections Scoring maximum likelihood, but penalize complex models Maximum likelihood computation using Bayes rule P(G D) = P(D G)P(G) P(D) Give higher priors for simple models For parameters Θ (i.e., CPTs in Bayes nets, λ in HMMs) P(D G) = Θ P(D G, Θ) P(Θ G)

Unknown structure, full observability Maximum likelihood computation using Bayes rule P(G D) = P(D G)P(G) P(D) Give higher priors for simple models For parameters Θ (i.e., CPTs in Bayes nets, λ in HMMs) P(D G) = Θ P(D G, Θ) P(Θ G) Conditional independence and Markov property allow decomposition of parameters P(D G) =Π i Θ P(X i Parents(X i ), Θ i ) P(Θ i ) dθ i

Learning structure with hidden variables Smoking Diet Exercise Heart Disease Symptom 1 Symptom 2 Symptom 3 The disease variable is hidden Can we just learn without it?

End up with fully connected network Sure! But not the best network for our domain Smoking Diet Exercise Symptom 1 Symptom 2 Symptom 3 With 708 parameters? Much harder to learn

Chicken or the egg problem If we had fully observable data (i.e., knew that a training instance (patient) had the disease ) It would be easy to learn P(symptom disease) But we can t observe disease, so we don t have this information If we knew the parameters (i.e., P(symptom disease) then it would be easy to estimate if a patient had the disease But we don t know these parameters! Smoking Diet Exercise Heart Disease Symptom 1 Symptom 2 Symptom 3

Use EM to learn the parameters Assume we DO know the parameters Initialize randomly E step:? M step:? Iterate until convergence!

Use EM to learn the parameters Assume we DO know the parameters Initialize randomly E step: Compute probability of instance (in data) having each possible value of the hidden variable Approximate likelihood by computing expected value M step: Treating each instance as fractionally having both values compute the new parameter values Iterate until convergence!