A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Similar documents
Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Lecture 12: Algorithms for HMMs

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Lecture 12: Algorithms for HMMs

Parametric Models Part III: Hidden Markov Models

Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

8: Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Log-Linear Models, MEMMs, and CRFs

Introduction to Machine Learning CMU-10701

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Statistical Methods for NLP

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

L23: hidden Markov models

Sequences and Information

STA 4273H: Statistical Machine Learning

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Dynamic Programming: Hidden Markov Models

Hidden Markov Models in Language Processing

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 6: Entropy Rate

Statistical Methods for NLP

A brief introduction to Conditional Random Fields

Lecture 11: Hidden Markov Models

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

COMP90051 Statistical Machine Learning

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Graphical models for part of speech tagging

Approximate Inference

Dept. of Linguistics, Indiana University Fall 2009

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

STA 414/2104: Machine Learning

Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models (HMMs) November 14, 2017

Hidden Markov Models Hamid R. Rabiee

CS838-1 Advanced NLP: Hidden Markov Models

Note Set 5: Hidden Markov Models

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

order is number of previous outputs

Lecture 13: Structured Prediction

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models. Terminology, Representation and Basic Problems

Pair Hidden Markov Models

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II

Conditional Random Fields: An Introduction

Lecture 12: EM Algorithm

Hidden Markov Models. Terminology and Basic Algorithms

Midterm sample questions

Conditional Random Fields

Hidden Markov Models. x 1 x 2 x 3 x N

A gentle introduction to Hidden Markov Models

Maximum Entropy Markov Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Conditional Random Fields for Sequential Supervised Learning

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

10/17/04. Today s Main Points

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Lecture 3: ASR: HMMs, Forward, Viterbi

Statistical Pattern Recognition

Computational Genomics and Molecular Biology, Fall

Machine Learning for Structured Prediction

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Advanced Data Science

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Dynamic Approaches: The Hidden Markov Model

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Bayesian Machine Learning - Lecture 7

CS 6140: Machine Learning Spring 2017

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

AN INTRODUCTION TO TOPIC MODELS

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Conditional Random Field

CS 188: Artificial Intelligence Spring 2009

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Probabilistic Models for Sequence Labeling

Machine Learning for Data Science (CS4786) Lecture 24

Statistical Processing of Natural Language

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Intelligent Systems (AI-2)

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov models

Data-Intensive Computing with MapReduce

Text Mining. March 3, March 3, / 49

Machine Learning, Fall 2011: Homework 5

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Transcription:

A.I. in health informatics lecture 8 structured learning kevin small & byron wallace

today models for structured learning: HMMs and CRFs structured learning is particularly useful in biomedical applications: parsing (clinical) text; genetic data, etc. we ll cover this in more detail; but need the basics first

unstructured learning assumptions: we re given a set of i.i.d. instances {x 1, x 2, x N } and their (univariate) labels {y 1, y 2, y N } no order or sequence to the data

unstructured learning: graphically y 1 y 2 y 3 x 1 x 2 x 3

structured learning assumptions: there is some correlation between a label y i and the preceding labels, in other words, y i is structured, ie., y i+1 is affected by the previous labels y i, y i-1

structured learning consider the task of part-of-speech (POS) tagging sentences - nouns tend to follow verbs image from: http://www.cnts.ua.ac.be/pages/mbsp

structured learning: graphically y 1 y 2 y 3 x 1 x 2 x 3

probability & structured learning p(x 1,x 2,...,x n ) = p(x i x 1,..., x i-1 ) n i=1 intractable!

(first-order) HMM y 1 y 2 y 3 x 1 x 2 x 3 * usually assume states (ys) are latent; but not always (see weather example, upcoming)

(first-order) MM p(x 1,x 2,...,x n ) = p(x 1 ) p(x i x i-1 ) n i=2 tractable!

markov model A 22 A 21 k = 2 A 12 A 32 A 23 k = 1 A 11 k = 3 A 31 A 13 A 33 from Bishop PMLR

markov model: unfolded k = 1 A 11 A 11 A 11 k = 2 k = 3 A 33 A 33 A 33 n 2 n 1 n n + 1 from Bishop PRML

markov model a ij probability of transitioning from state i to j - (we have to go somewhere!) - a ij =1 j a ij 0

let s talk about the weather the world has three states: 1 rainy, 2 cloudy, 3 sunny (in the weather case the markov model is not hidden) our transition matrix is: note that this system likes to stay where it is!

the weather probability that of observing the weather {sunny, sunny, rainy}? R C S = 1.0 * P(S S)*P(R S) =.8*.1

the weather it s sunny today. what s the probability that it remains so for k days? R C S = (.8) k

hidden markov models in weather example, we only care about transitions the (weather) states were the observations often we need to model data in which an observation is generated conditional on some latent, underlying state e.g., bias coin example; urn-genie example enter the HMM

the urn example the actual urn the genie draws from is unobserved we only know the sequence of draws! a genie (?!?) is in the room choosing urns to draw balls from each earn contains different proportions of the various colored balls

hidden markov models: sufficient parameters N states (latent), M symbols (observed): symbols are observed conditioned on the current state A - transition probabilities (from urn to urn) B symbol emission probabilities (color proportions in each urn) π initial state distribution (initial urn likelihood) λ = (A, B, π) specifies our model

hidden markov models: the three problems 1 given a set of observations o = {o 1, o 2 o T } and a model λ, compute P(o λ) 2 given o, λ, calculate most likely latent states (usually thought of as labels) q = {q 1, q 2 q T } 3 given o, calculate λ

hidden markov models: problem 1 1 given a set of observations o = {o 1, o 2 o T } and a model λ, compute P(o λ) P(o λ) = P(o q, λ)p(q λ) = all Q cool. so we re done? not quite. this will require O(N T ) calculations

dynamic programming to the rescue!?me T

dynamic programming to the rescue!

about that runtime N states to (N-1) other states T times can save a bit using the backward direction, too (see the paper); this will be referred to with a β and is analogous to α?me T

hidden markov models: the three problems

hidden markov models: problem 2 2 given a set of observations o = {o 1, o 2 o T } and a model λ, find most likely latent states (urns, say) q* = {q 1, q 2 q T } a unique solution need not exist! what are we even optimizing here? individual most likely states q i? joint probability q?

hidden markov models: problem 2 problem optimizing for the most likely state at any given time can lead to impossible sequences under λ so let s consider the joint instead

solving problem 2; the joint solution want to solve for q*: q* argmaxp(q,0 λ) q obviously enumerating all possible q sequences is infeasible dynamic programming to the rescue, again!

the viterbi algorithm define: (most likely sequence up to time t-1). the inductive step: that s it! we just need to keep track of the states we visit! i.e., from the best path so far, we find the most likely transition/emission probability

hidden markov models: the three problems

EM for λ forward to state i emit observation in state j backward to state j define from i to j (the probability of being in state i at time t and state j at time t+1)

EM for λ probability of being in state i is: (we have to transition somewhere)

EM for λ all of these es?mates can be computed from the observed data! (by coun?ng!)

EM for λ at a given time t we have λ = (A,B,π ) E-step calculate the likelihood of our observations o using the current estimates M-step re-estimate the parameters (left-hand sides of equations on previous slide) using current estimates

shortcomings of HMMs HMMs model the joint probability of the observations (x) and (y) - ie., it s a generative model but what we really care about is the conditional probability of a label sequence y given x enter conditional markov models

conditional models Y i 1 Y i Y i+1 X i 1 X i X i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 HMM MEMM CRF conditional models don t waste time modeling the observed data

MEMMs (2000, McCallum et al) MEMM: Maximum Entropy Markov Model output probability vector of transitioning from one state to all other states, given an input observation x - conditional in that we estimate p(state) at time t given an observation and the preceding state st-1 s t st-1 s t o t o t HMM MEMM

MEMMs v. HMM HMM prob. of emihng o and being in state s at?me t α t+1 (s) = s S α t (s )P (s s )P (o t+1 s) MEMM α t+1 (s) = s S α t (s )P s (s o t+1 )

CRFs (2001, Lafferty et al) CRF: Conditional Random Field single model for the joint probability of the sequence of labels given the observations mitigates the label bias problem in which states with low-entropy (almost certain) transition probability vectors effectively ignore the observation Y i 1 Y i Y i+1 Y i 1 Y i Y i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 X i 1 X i X i+1 X i 1 X i X i+1 HMM MEMM CRF