Maxent Models and Discriminative Estimation

Similar documents
Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Midterm sample questions

Machine Learning for natural language processing

Maximum Entropy Klassifikator; Klassifikation mit Scikit-Learn

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

A brief introduction to Conditional Random Fields

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Naïve Bayes, Maxent and Neural Models

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Undirected Graphical Models

The Noisy Channel Model and Markov Models

Natural Language Processing

Machine Learning for NLP

Statistical Methods for NLP

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression & Neural Networks

Hidden Markov Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Intelligent Systems (AI-2)

NaturalLanguageProcessing-Lecture08

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Models

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Sequential Supervised Learning

Hidden Markov Models in Language Processing

Intelligent Systems (AI-2)

Machine Learning for NLP

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Mid-term Reviews. Preprocessing, language models Sequence models, Syntactic Parsing

Hidden Markov Models

TnT Part of Speech Tagger

Lecture 7: Sequence Labeling

Hidden Markov Models

Sequential Data Modeling - The Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Text Mining. March 3, March 3, / 49

Statistical NLP Spring 2010

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Support Vector Machines

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning for Structured Prediction

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STA 4273H: Statistical Machine Learning

CPSC 340: Machine Learning and Data Mining

Naïve Bayes classification

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Lecture 13: Structured Prediction

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

10/17/04. Today s Main Points

Intelligent Systems (AI-2)

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Linear Models for Classification

Applied Natural Language Processing

CSCI-567: Machine Learning (Spring 2019)

Deep Learning for NLP

Statistical Methods for NLP

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Machine Learning Algorithm. Heejun Kim

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Hidden Markov Models

Statistical methods in NLP, lecture 7 Tagging and parsing

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Final Examination CS 540-2: Introduction to Artificial Intelligence

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

CS838-1 Advanced NLP: Hidden Markov Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Graphical models for part of speech tagging

N-gram Language Modeling

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Natural Language Processing

MIA - Master on Artificial Intelligence

Classification & Information Theory Lecture #8

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Probabilistic Graphical Models

CS145: INTRODUCTION TO DATA MINING

Statistical NLP Spring A Discriminative Approach

CSCE 478/878 Lecture 6: Bayesian Learning

Loss Functions, Decision Theory, and Linear Models

Introduction to Machine Learning Midterm, Tues April 8

CS460/626 : Natural Language

Bayesian Learning (II)

NLP Programming Tutorial 11 - The Structured Perceptron

Logistic Regression. Machine Learning Fall 2018

Planning and Optimal Control

LECTURER: BURCU CAN Spring

Transcription:

Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6)

Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much use of conditional or discriminative probabilistic models in NLP, Speech, IR (and ML generally) Because: They give high accuracy performance They make it easy to incorporate lots of linguistically important features

Joint vs. Conditional Models We have some data {(d, c)} of paired observations d and hidden classes c. Joint (generative) models place probabilities over both observed data and the hidden stuff (generate the observed data from hidden stuff): All the classic StatNLP models: n-gram models, Naive Bayes classifiers, hidden Markov models, probabilistic context-free grammars P(c,d)

Joint vs. Conditional Models Discriminative (conditional) models take the data as given, and put a probability over hidden structure given the data: Logistic regression, conditional loglinear or maximum entropy models, conditional random fields are discriminative classifiers P(c d)

Bayes Net/Graphical Models Bayes net diagrams draw circles for random variables, and lines for direct dependencies Some variables are observed; some are hidden Each node is a little classifier based on its incoming arcs P(c) c c P(c d 1,d 2,d 3 ) P(d 1 c) d 1 d 2 d 3 Naive Bayes Generative d 1 d 2 d 3 Logistic Regression Discriminative

Conditional vs. Joint Likelihood A joint model gives probabilities P(d,c) and tries to maximize this joint likelihood. It turns out to be trivial to choose weights: just relative frequencies. A conditional model gives probabilities P(c d). It takes the data as given and models only the conditional probability of the class. We seek to maximize conditional likelihood. Harder to do More closely related to classification error.

Conditional models work well: Word Sense Disambiguation Training Set Objective Accuracy Joint Like. 86.8 Cond. Like. 98.5 Even with exactly the same features, changing from joint to conditional estimation increases performance Objective Test Set Accuracy Joint Like. 73.6 Cond. Like. 76.1 (Klein and Manning 2002, using Senseval-1 Data) The same smoothing, and the same word-class features, just change the numbers (parameters)

Discriminative Model Features Making features from text for discriminative NLP models Christopher Manning

Features In these slides and most maxent work: features f are elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predict A feature is a function with a bounded real value: f: C D R

Example features 1.8 f 1 (c, d) [c = LOCATION w -1 = in iscapitalized(w)] -0.6 f 2 (c, d) [c = LOCATION hasaccentedlatinchar(w)] 0.3 f 3 (c, d) [c = DRUG ends(w, c )] LOCATION in Arcadia LOCATION in Québec DRUG taking Zantac PERSON saw Sue Models will assign to each feature a weight: A positive weight votes that this configuration is likely correct A negative weight votes that this configuration is likely incorrect

Example features 1.8 f 1 (c, d) [c = LOCATION w -1 = in iscapitalized(w)] -0.6 f 2 (c, d) [c = LOCATION hasaccentedlatinchar(w)] 0.3 f 3 (c, d) [c = DRUG ends(w, c )] LOCATION in Arcadia LOCATION in Québec DRUG taking Zantac PERSON saw Sue Models will assign to each feature a weight: A positive weight votes that this configuration is likely correct A negative weight votes that this configuration is likely incorrect

Features In NLP uses, usually a feature specifies 1. an indicator function a yes/no Boolean matching function of properties of the input and 2. a particular class f i (c, d) [Φ(d) c = c j ] [Value is 0 or 1] Each feature picks out a data subset and suggests a label for it

Feature-Based Models The decision about a data point is based only on the features active at that point. Data BUSINESS: Stocks hit a yearly low Data to restructure bank:money debt. Data DT JJ NN The previous fall Label: BUSINESS Features {, stocks, hit, a, yearly, low, } Text Categorization Label: MONEY Features {, w -1 =restructure, w +1 =debt, L=12, } Word-Sense Disambiguation Label: NN Features {w=fall, -1 t =JJ w - 1=previous} POS Tagging

Example: Text Categorization (Zhang and Oles 2001) Features are presence of each word in a document and the document class (they do feature selection to use reliable indicator words) Tests on classic Reuters data set (and others) Naïve Bayes: 77.0% F 1 Linear regression: 86.0% Logistic regression: 86.4% Support vector machine: 86.5% Paper emphasizes the importance of regularization (smoothing) for successful use of discriminative methods (not used in much early NLP/IR work)

Other Maxent Classifier Examples You can use a maxent classifier whenever you want to assign data points to one of a number of classes: Sentence boundary detection (Mikheev 2000) Is a period end of sentence or abbreviation? Sentiment analysis (Pang and Lee 2002) Word unigrams, bigrams, POS counts, PP attachment (Ratnaparkhi 1998) Attach to verb or noun? Features of head noun, preposition, etc. Parsing decisions in general (Ratnaparkhi 1997; Johnson et al. 1999, etc.)

Feature-based Linear Classifiers How to put features into a classifier 16

Feature-Based Linear Classifiers Linear classifiers at classification time: Linear function from feature sets {f i } to classes {c}. Assign a weight i to each feature f i We consider each class for an observed datum d For a pair (c,d), features vote with their weights: vote(c) = i f i (c,d) PERSON in Québec LOCATION in Québec DRUG in Québec Choose the class c which maximizes i f i (c,d)

Feature-Based Linear Classifiers Linear classifiers at classification time: Linear function from feature sets {f i } to classes {c}. Assign a weight i to each feature f i We consider each class for an observed datum d For a pair (c,d), features vote with their weights: vote(c) = i f i (c,d) PERSON in Québec LOCATION in Québec DRUG in Québec Choose the class c which maximizes i f i (c,d)

Feature-Based Linear Classifiers Exponential (log-linear, maxent, logistic) models: Make a probabilistic model from the linear combination i f (c,d) i exp i f i ( c, d) Makes votes positive P( c d, ) c' exp i i f i ( c', d) P(LOCATION in Québec) = e 1.8 e 0.6 /(e 1.8 e 0.6 + e 0.3 + e 0 ) = 0.586 P(DRUG in Québec) = e 0.3 /(e 1.8 e 0.6 + e 0.3 + e 0 ) = 0.238 P(PERSON in Québec) = e 0 /(e 1.8 e 0.6 + e 0.3 + e 0 ) = 0.176 i Normalizes votes PERSON in Québec LOCATION in Québec DRUG in Québec

Naive Bayes vs. Maxent models Generative vs. Discriminative models: The problem of overcounting evidence Christopher Manning

Europe Text classification: Asia or Europe Training Data Asia Hong Kong Hong Kong Hong Kong Hong Kong NB Model Class X 1 =M NB FACTORS: P(A) = P(E) = ½ P(M A) = ¼ P(M E) = ¾ PREDICTIONS: P(A,M) = ½*¼ = 1/8 P(E,M) = ½*¾ = 3/8 P(A M) = ¼ P(E M) = ¾

Europe Text classification: Asia or Europe Training Data Asia Hong Kong Hong Kong Hong Kong Hong Kong NB Model Class X 1 =H X 2 =K NB FACTORS: P(A) = P(E) = ½ P(H A) = P(K A) = 3/8 P(H E) = P(K E) = 1/8 PREDICTIONS: P(A,H,K) = ½ * 3/8 * 3/8 9 P(E,H,K) = ½ * 1/8 * 1/8 1 P(A H,K) = 9/10 P(E H,K) = 1/10

Europe Text classification: Asia or Europe Training Data Asia Hong Kong Hong Kong Hong Kong Hong Kong NB Model Class H K M NB FACTORS: P(A) = P(E) = ½ P(M A) = ¼ P(M E) = ¾ P(H A) = P(K A) = 3/8 P(H E) = P(K E) = 1/8 PREDICTIONS: P(A,H,K,M) = ½*3/8*3/8*1/4 3 P(E,H,K,M) = ½*1/8*1/8*3/4 1 P(A H,K,M) = 3/4 P(E H,K,M) = 1/4

Naive Bayes vs. Maxent Models Naive Bayes models multi-count correlated evidence Each feature is multiplied in, even when you have multiple features telling you the same thing Maximum Entropy models (pretty much) solve this problem This is done by weighting features so that model expectations match the observed (empirical) expectations