Structured Output Prediction: Generative Models

Similar documents
CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

A Support Vector Method for Multivariate Performance Measures

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Lecture 9: Hidden Markov Model

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Text Mining. March 3, March 3, / 49

Lecture 13: Structured Prediction

LECTURER: BURCU CAN Spring

Statistical Methods for NLP

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Models

Intelligent Systems (AI-2)

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

10/17/04. Today s Main Points

Intelligent Systems (AI-2)

Support Vector Machines: Kernels

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

The Noisy Channel Model and Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models

COMP90051 Statistical Machine Learning

A gentle introduction to Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Sequence Labeling: HMMs & Structured Perceptron

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

A brief introduction to Conditional Random Fields

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Hidden Markov Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Generative Models for Classification

Machine Learning for natural language processing

Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Linear Classifiers IV

Probabilistic Context-free Grammars

Hidden Markov Models in Language Processing

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Spatial Role Labeling CS365 Course Project

Hidden Markov Models

Statistical Methods for NLP

Directed Probabilistic Graphical Models CMSC 678 UMBC

Machine Learning for Structured Prediction

Midterm sample questions

Maximum Entropy and Log-linear Models

POS-Tagging. Fabian M. Suchanek

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Hidden Markov Models (HMMs)

Log-Linear Models, MEMMs, and CRFs

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Hidden Markov Models. x 1 x 2 x 3 x N

Sequential Supervised Learning

p(d θ ) l(θ ) 1.2 x x x

Today. Finish word sense disambigua1on. Midterm Review

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Sequences and Information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

STA 414/2104: Machine Learning

Neural Architectures for Image, Language, and Speech Processing

Computers playing Jeopardy!

Lecture 15. Probabilistic Models on Graph

CS Lecture 18. Topic Models and LDA

AN ABSTRACT OF THE DISSERTATION OF

Learning Features from Co-occurrences: A Theoretical Analysis

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Linear Dynamical Systems (Kalman filter)

Advanced Data Science

Introduction to Machine Learning CMU-10701

CS 188: Artificial Intelligence Fall 2011

Information Extraction from Text

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Probabilistic Graphical Models

NLP Programming Tutorial 11 - The Structured Perceptron

Stochastic Parsing. Roberto Basili

An introduction to PRISM and its applications

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Conditional Random Field

Variational Decoding for Statistical Machine Translation

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Parsing with Context-Free Grammars

Expectation Maximization (EM)

Chapter 3: Basics of Language Modelling

Natural Language Processing

Fun with weighted FSTs

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Temporal Modeling and Basic Speech Recognition

Statistical Methods for NLP

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

HMM and Part of Speech Tagging. Adam Meyers New York University

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Transcription:

Structured Output Prediction: Generative Models CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 17.3, 17.4, 17.5.1

Structured Output Prediction Supervised Learning from Examples Find function from input space X to output space Y h: X Y such that the prediction error is low. Typical Output space is just a single number Classification: -1,+1 Regression: some real number General Predict outputs that are complex objects

Examples of Complex Output Spaces Natural Language Parsing Given a sequence of words x, predict the parse tree y. Dependencies from structural constraints, since y has to be a tree. y S x The dog chased the cat NP VP NP Det N V Det N

Examples of Complex Output Spaces Multi-Label Classification Given a (bag-of-words) document x, predict a set of labels y. Dependencies between labels from correlations between labels ( iraq and oil in newswire corpus) x Due to the continued violence in Baghdad, the oil price is expected to further increase. OPEC officials met with y -1-1 -1 +1 +1-1 -1-1 antarctica benelux germany iraq oil coal trade acquisitions

Examples of Complex Output Spaces x Noun-Phrase Co-reference Given a set of noun phrases x, predict a clustering y. Structural dependencies, since prediction has to be an equivalence relation. Correlation dependencies from interactions. The policeman fed y The policeman fed the cat. He did not know that he was late. The cat is called Peter. the cat. He did not know that he was late. The cat is called Peter.

Examples of Complex Output Spaces Scene Recognition Given a 3D point cloud with RGB from Kinect camera Segment into volumes Geometric dependencies between segments (e.g. monitor usually close to keyboard)

Part-of-Speech Tagging Task Assign the correct part of speech (word class) to each word in a document The/DT planet/nn Jupiter/NNP and/cc its/prp moons/nns are/vbp in/in effect/nn a/dt mini-solar/jj system/nn,/, and/cc Jupiter/NNP itself/prp is/vbz often/rb called/vbn a/dt star/nn that/in never/rb caught/vbn fire/nn./. Needed as an initial processing step for a number of language technology applications Information extraction Answer extraction in QA Base step in identifying syntactic phrases for IR systems Critical for word-sense disambiguation (WordNet apps)

Why is POS Tagging Hard? Ambiguity He will race/vb the car. When will the race/nn end? I bank/vb at CFCU. Go to the bank/nn! Average of ~2 parts of speech for each word The number of tags used by different systems varies a lot. Some systems use < 20 tags, while others use > 400.

Example The POS Learning Problem

Markov Model Definition Set of States: s 1,,s k Start probabilities: P(S 1 =s) Transition probabilities: P(S i =s S i-1 =s ) Random walk on graph Start in state s with probability P(S 1 =s) Move to next state with probability P(S i =s S i-1 =s ) Assumptions Limited dependence: Next state depends only on previous state, but no other state (i.e. first order Markov model) Stationary: P(S i =s S i-1 =s ) is the same for all i

Hidden Markov Model for POS Tagging States Think about as nodes of a graph One for each POS tag special start state (and maybe end state) Transitions Think about as directed edges in a graph Edges have transition probabilities Output Each state also produces a word of the sequence Sentence is generated by a walk through the graph

Hidden Markov Model States: y {s 1,,s k } Outputs symbols: x {o 1,,o m } Starting probability P(Y 1 = y 1 ) Specifies where the sequence starts Transition probability P(Y i = y i Y i-1 = y i-1 ) Probability that one states succeeds another Output/Emission probability P(X i = x i Y i = y i ) Probability that word is generated in this state => Every output+state sequence has a probability P x, y = P(x 1,, x l, y 1,, y l ) l = P y 1 P(x 1 y 1 ) P x i y i P y i y i 1 ) i=2

Estimating the Probabilities Fully observed data: input/output sequence pairs # of times state a follows state b P Y i = a Y i 1 = b = # of times state b occurs # of times output a is observed in state b P X i = a Y i = b = # of times state b occurs Smoothing the estimates: See Naïve Bayes for text classification Partially observed data (Y i unknown): Expectation-Maximization (EM)

HMM Prediction (Decoding) Question: What is the most likely state sequence given an output sequence? y = argmax P(x 1,, x l, y 1,, y l ) y {y 1,,y l } = argmax y {y 1,,y l } l P y 1 P x 1 y 1 P x i y i P y i y i 1 ) i=2

Deal: trip to any 3 cities in Germany -> Italy -> Spain for one low low price Going on a trip

Going on a trip Deal: trip to any 3 cities in Germany -> Italy -> Spain for one low low price Country Germany Italy Spain City options Berlin/Munich/Witten Rome/Venice/Milan Madrid/Barcelona/Malaga

Going on a trip Deal: Each city i has an attractiveness score c i [0, 10] Each flight has an comfort score f i,j [0, 10] Find the best trip!

Going on a trip c Be = 6 f H,Be = 2 Berlin Rome Madrid c Wi = 4 Home Munich Venice Barcelona Home f H,Mu = 1 c Wi = 2 f H,Wi = 1 Witten Milan Malaga

Going on a trip Home Berlin best Be = 8 Munich best Mu = 5 1 4 7 6 Rome Venice Madrid Barcelona Home Hamburg best Wi = 3 Milan Malaga

Going on a trip Berlin Rome best Ro = 16 Madrid Home Munich Venice best Ve = 12 Barcelona Home Hamburg Milan best Mi = 9 Malaga

Viterbi Algorithm for Decoding Efficiently compute most likely sequence y = argmax y {y 1,,y l } Viterbi Algorithm: l P y 1 P x 1 y 1 P x i y i P y i y i 1 ) δ y 1 = P Y 1 = y P X 1 = x 1 Y 1 = y) δ y i + 1 = max v s 1,,s k δ v i P Y i+1 = y Y i = v P X i+1 = x i+1 Y i+1 = y i=2

Viterbi Example P(X i Y i ) I bank at CFCU go to the DET 0.01 0.01 0.01 0.01 0.01 0.01 0.94 PRP 0.94 0.01 0.01 0.01 0.01 0.01 0.01 N 0.01 0.4 0.01 0.4 0.16 0.01 0.01 PREP 0.01 0.01 0.48 0.01 0.01 0.47 0.01 V 0.01 0.4 0.01 0.01 0.55 0.01 0.01 P(Y 1 ) DET 0.3 PRP 0.3 N 0.1 PREP 0.1 V 0.2 P(Y i Y i-1 ) DET PRP N PREP V DET 0.01 0.01 0.96 0.01 0.01 PRP 0.01 0.01 0.01 0.2 0.77 N 0.01 0.2 0.3 0.3 0.19 PREP 0.3 0.2 0.3 0.19 0.01 V 0.2 0.19 0.3 0.3 0.01