MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Similar documents

Announcements. Guest lectures schedule: D. Sculley, Google Pgh, 3/26 Alex Beutel, SGD for tensors, 4/7 Alex Smola, something cool, 4/9

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Lecture 13: Structured Prediction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Sequential Supervised Learning

Intelligent Systems (AI-2)

Machine Learning for Structured Prediction

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Natural Language Processing

Probabilistic Models for Sequence Labeling

Intelligent Systems (AI-2)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Conditional Random Field

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

NLP Programming Tutorial 11 - The Structured Perceptron

Statistical Methods for NLP

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Conditional Random Fields for Sequential Supervised Learning

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Undirected Graphical Models for Sequence Analysis

STA 4273H: Statistical Machine Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Graphical models for part of speech tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Language and Statistics II

Conditional Random Fields

Chunking with Support Vector Machines

STA 414/2104: Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Text Mining. March 3, March 3, / 49

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

lecture 6: modeling sequences (final part)

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

A brief introduction to Conditional Random Fields

with Local Dependencies

Log-Linear Models, MEMMs, and CRFs

Sequence Labeling: HMMs & Structured Perceptron

Expectation Maximization (EM)

Undirected Graphical Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Structured Prediction

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Multiclass Classification-1

Maximum Entropy Markov Models

8: Hidden Markov Models

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Hidden Markov Models

Lecture 5 Neural models for NLP

CSE 490 U Natural Language Processing Spring 2016

The Noisy Channel Model and Markov Models

A gentle introduction to Hidden Markov Models

Simple Neural Nets For Pattern Classification

Conditional Random Fields: An Introduction

COMP90051 Statistical Machine Learning

Hidden Markov Models

Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA

Midterm sample questions

Random Field Models for Applications in Computer Vision

Statistical Methods for NLP

Applied Natural Language Processing

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

2.2 Structured Prediction

Machine Learning Basics III

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Machine Learning, Fall 2009: Midterm

Deep Learning for NLP

Final Examination CS 540-2: Introduction to Artificial Intelligence

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Intro to Neural Networks and Deep Learning

ML4NLP Multiclass Classification

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Statistical NLP for the Web

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hidden Markov Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

COMS 4771 Introduction to Machine Learning. Nakul Verma

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Learning to translate with neural networks. Michael Auli

Hidden Markov Models in Language Processing

Pattern Recognition and Machine Learning

Lecture 12: Algorithms for HMMs

Revision: Neural Network

Feedforward Neural Networks

Semi-Markov Conditional Random Fields for Information Extraction

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Sequential Data Modeling - The Structured Perceptron

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Structured Prediction

Structure Learning in Sequential Data

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Transcription:

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

MOTIVATING STRUCTURED- OUTPUT PREDICTION FOR NLP

Types of machine learning Input Attributes Input Attributes Classifier Density Estimator Prediction of categorical output or class One of a few discrete values Probability Input Attributes Regressor Prediction of real-valued output Copyright Andrew W. Moore

Outline Some Sample NLP Task [Noah Smith] What do these have in common? We re making many related predictions... Structured Prediction For NLP Structured Prediction Methods Structured Perceptron Conditional Random Fields Deep learning and NLP

via begin-inside-outside token tagging beginx inx* delimits entity X begingpe out out out out begingf ingf

via begin-inside-outside token tagging beginx inx* delimits entity X thistoken = english thistokenshape = X+ prevtoken = the prevprevtoken = across nettoken = channel netnettoken = Monday nettokenshape = X+ prevtokenshape = +. thistoken_english = 1 thistokenshape_x+ = 1 prevtoken_the = 1 prevprevtoken_across = 1 begingf begingf ingf a classification task!

via begin-inside-outside token tagging A problem: the instances are not i.i.d, nor are the classes. inx usually happens after beginx or inx beginx usually happens after out thistoken_the = 1 thistokenshape_+ = 1 prevtoken_across = 1 thistoken_english = 1 thistokenshape_x+ = 1 prevtoken_the = 1 prevprevtoken_across = 1 thistoken_channel = 1 thistokenshape_x+ = 1 prevtoken_english = 1 prevprevtoken_the = 1 out begingf ingf

Structured Output Prediction Structured Output Prediction: deqinition classiqier where output is structured, and input might be structured eample: predict a sequence of begin- in- out tags from a sequence of tokens (represented by a sequence of feature vectors)

NER via begin-inside-outside token tagging A problem: with K begin-inside-outside tags and N words, we have N K possible structured labels. = How can we learn to chose among so many possible outputs? y = begingpe out out out out begingf ingf

HMMS AND NER

out begingpe ingpe e.g., HMMs map a sequence of tokens to a sequence of outputs (the hidden states) begingf beginper ingf inper = y = begingpe out out out out begingf ingf

Borkar et al s: HMMs for segmentation n n n Eample: Addresses, bib records Problem: some DBs may split records up differently (eg no mail stop field, combine address and apt #, ) or not at all Solution: Learn to segment tetual form of records Author Year Title Journal Volume Page P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. 21

IE with Hidden Markov Models Must learn transition, emission probabilities Smith Cohen Jordan Transition probabilities 0.01 0.05 0.3 0.1 dddd dd 0.5 Author 0.9 Year 0.8 0.2 0.8 Journal 0.2 Title 0.5 Learning Conve Comm. Trans. Chemical Emission probabilities 0.06 0.03.. 0.04 0.02 0.004 22

HMMS limitation: not well-suited to modeling a set of features for each token. thistoken = english thistokenshape = X+ prevtoken = the prevprevtoken = across nettoken = channel netnettoken = Monday nettokenshape = X+ prevtokenshape = +. thistoken_english = 1 thistokenshape_x+ = 1 prevtoken_the = 1 prevprevtoken_across = 1 begingf begingf ingf a classification task!

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in -ski is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor is Wisniewski part of noun phrase S t - 1 ends in -ski O t - 1 S t O t S t+1 O t +1 Lots of learning systems are not confounded by multiple, nonindependent features: decision trees, neural nets, SVMs,

Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods HMMs Conditional Random Fields Structured Perceptron Discussion

CONDITIONAL RANDOM FIELDS

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in -ski is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor is Wisniewski part of noun phrase S t - 1 ends in -ski O t - 1 S t O t S t+1 O t +1 Lots of learning systems are not confounded by multiple, nonindependent features: decision trees, neural nets, SVMs,

Inference for linear-chain MRFs When will prof Cohen post the notes Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them. (y(i)==b or y(i)==i) and (token(i) is capitalized) (y(i)==i and y(i-1)==b) and (token(i) is hyphenated) (y(i)==b and y(i-1)==b) eg tell Ziv William is on the way Idea 2: construct a graph where each path is a possible sequence labeling.

Inference for a linear-chain MRF B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes Inference: find the highest-weight path Learning: optimize the feature weights so that this highest-weight path is correct relative to the data

Inference for a linear-chain MRF When will prof Cohen post the notes B B B B B B B I I I I I I I O O O O O O O Feature + weights define edge weights: weight(y i,y i+1 ) = 2*[(y i =B or I) and iscap( i )] + 1*[(y i =B and isfirstname( i )] - 5*[(y i+1 B and islower( i ) and isupper( i+1 )]

CRF learning from Sha & Pereira

CRF learning from Sha & Pereira

CRF learning from Sha & Pereira Something like forward-backward Idea: Define matri of y,y affinities at stage i M i [y,y ] = unnormalized probability of transition from y to y at stage I M i * M i+1 = unnormalized probability of any path through stages i and i+1

Sha & Pereira results CRF beats MEMM (McNemar s test); MEMM probably beats voted perceptron

Sha & Pereira results in minutes, 375k eamples

THE STRUCTURED PERCEPTRON

Review: The perceptron A instance i ^ y i B ^ Compute: y i = v k. i If mistake: v k+1 = v k + y i i y i

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 (3b) The guess v 2 after the one positive and one negative eample: v 2 =v 1-2 + 2 v 2 u u v 2 >γ v 1 v 1 + 1 -u -u - 2 2γ 2γ

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 (3b) The guess v 2 after the one positive and one negative eample: v 2 =v 1-2 + 2 v 2 u u v 2 v 1 v 1 + 1 -u -u - 2 2γ 2γ

2 2 2 2 R = γ 2 2

The voted perceptron for ranking A instances 1 2 3 4 b* b B ^ Compute: y i = v k. i Return: the inde b* of the best i If mistake: v k+1 = v k + b - b*

u Ranking some s with the target vector u γ -u

u Ranking some s with some guess vector v part 1 γ v -u

u Ranking some s with some guess vector v part 2. -u v The purple-circled is b* - the one the learner has chosen to rank highest. The green circled is b, the right answer.

u Correcting v by adding b b* v -u

V k+1 Correcting v by adding b b* (part 2) v k

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 u + 2 v 2 u >γ v v 1 -u -u 2γ

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 u + 2 v 2 u >γ v v 1 -u -u 3 2γ

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 u + 2 v 2 u >γ v 1 v -u 2γ -u 3

Notice this doesn t depend at all on the number of s being ranked u (3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 + 2 v 2 u >γ v v 1 -u -u 2γ Neither proof depends on the dimension of the s.

The voted perceptron for ranking A instances 1 2 3 4 b* b B ^ Compute: y i = v k. i Return: the inde b* of the best i If mistake: v k+1 = v k + b - b* Change number one is notation: replace with g

The voted perceptron for NER A instances g 1 g 2 g 3 g 4 B ^ Compute: y i = v k. g i Return: the inde b* of the best g i b* b If mistake: v k+1 = v k + g b - g b* 1. A sends B the Sha & Pereira paper, some feature functions, and instructions for creating the instances g: A sends a word vector i. Then B could create the instances g 1 =F( i,y 1 ), g 2 = F( i,y 2 ), but instead B just returns the y* that gives the best score for the dot product v k. F( i,y*) by using Viterbi. 2. A sends B the correct label sequence y i. 3. On errors, B sets v k+1 = v k + g b - g b* = v k + F( i,y) - F( i,y*)

The voted perceptron for NER A instances z 1 z 2 z 3 z 4 B ^ Compute: y i = v k. z i Return: the inde b* of the best z i b* b If mistake: v k+1 = v k + z b - z b* 1. A sends a word vector i. 2. B just returns the y* that gives the best score for v k. F( i,y*) 3. A sends B the correct label sequence y i. 4. On errors, B sets v k+1 = v k + z b - z b* = v k + F( i,y) - F( i,y*) So, this algorithm can also be viewed as an approimation to the CRF learning algorithm where we re using a Viterbi approimation to the epectations, and stochastic gradient descent to optimize the likelihood.

And back to the paper.. EMNLP 2002, Best paper

Collins Eperiments POS tagging (with MXPOST features) NP Chunking (words and POS tags from Brill s tagger as features) and BIO output tags Compared Maent Tagging/MEMM s (with iterative scaling) and Voted Perceptron trained HMM s With and w/o averaging With and w/o feature selection (count>5)

Collins results

Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

Discussion Structured prediction is just one part of NLP There is lots of work on learning and NLP It would be much easier to summarize nonlearning related NLP CRFs/HMMs/structured perceptrons are just one part of structured output prediction Chris Dyer course. Hot topic now: representation learning and deep learning for NLP NAACL 2013: Deep learning for NLP, Manning