Phrase Finding; Stream-and-Sort vs Request-and-Answer. William W. Cohen

Similar documents
The Noisy Channel Model and Markov Models

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Classification & Information Theory Lecture #8

Logistic Regression. William Cohen

Natural Language Processing SoSe Words and Language Model

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Natural Language Processing

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Midterm sample questions

Announcements. Guest lectures schedule: D. Sculley, Google Pgh, 3/26 Alex Beutel, SGD for tensors, 4/7 Alex Smola, something cool, 4/9

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Unsupervised Vocabulary Induction

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

The Naïve Bayes Classifier. Machine Learning Fall 2017

Probabilistic Language Modeling

Natural Language Processing. Statistical Inference: n-grams

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

N-gram Language Modeling

Statistical Methods for NLP

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Click Prediction and Preference Ranking of RSS Feeds

Naïve Bayes, Maxent and Neural Models

Statistical Methods for NLP

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Hidden Markov Models. x 1 x 2 x 3 x N

Information Retrieval and Organisation

Deep Learning for NLP

Hidden Markov Models

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Latent Dirichlet Allocation

N-gram Language Modeling Tutorial

Chapter 3: Basics of Language Modelling

Lecture 5 Neural models for NLP

Generative Clustering, Topic Modeling, & Bayesian Inference

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Natural Language Processing with Deep Learning CS224N/Ling284

Planning and Optimal Control

More Smoothing, Tuning, and Evaluation

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Lecture 12: Algorithms for HMMs

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Intelligent Systems (AI-2)

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Modern Information Retrieval

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Intelligent Systems (AI-2)

Information Extraction from Text

Language Processing with Perl and Prolog

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Kernel Methods. Barnabás Póczos

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Stephen Scott.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Prenominal Modifier Ordering via MSA. Alignment

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Lecture 12: Algorithms for HMMs

Feature Engineering, Model Evaluations

CS 188: Artificial Intelligence Spring Today

Sequences and Information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Probability Review and Naïve Bayes

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Machine Learning for natural language processing

What s Cooking? Predicting Cuisines from Recipe Ingredients

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

What is semi-supervised learning?

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

TnT Part of Speech Tagger

DT2118 Speech and Speaker Recognition

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Chapter 3: Basics of Language Modeling

Latent Dirichlet Allocation Based Multi-Document Summarization

Machine Learning for natural language processing

Analysis of a Probabilistic Model of Redundancy in Unsupervised Information Extraction

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

text classification 3: neural networks

Maxent Models and Discriminative Estimation

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

MLE/MAP + Naïve Bayes

Generative MaxEnt Learning for Multiclass Classification

Transcription:

Phrase Finding; Stream-and-Sort vs Request-and-Answer William W. Cohen

Announcement Two-day extension for all on HW1B: now due Thursday 1/29

Correction

Using Large-vocabulary Naïve Bayes For each example id, y, x 1,.,x d in test: train: Sort the event-counter update messages Scan and add the sorted messages and output the final counter values Initialize a HashSet NEEDED and a hashtable C For each example id, y, x 1,.,x d in test: Add x 1,.,x d to NEEDED For each event, C(event) in the summed counters [For assignment] If event involves a NEEDED term x read it into C For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) =. Model size: O( V ) Time: O(n 2 ), size of test Memory: same Time: O(n 2 ) Memory: same Time: O(n 2 ) Memory: same 4

Review of NB algorithms HW Train events Test events Suggested data 1A HashMap HashMap RCV1 1B Msgs è Disk HashMap (for subset) --- Msgs è Disk Msgs on Disk (coming.) Wikipedia ---

Outline Even more on stream-and-sort and naïve Bayes Another problem: meaningful phrase finding Implementing phrase finding efficiently Some other phrase-related problems

Last Week How to implement Naïve Bayes Time is linear in size of data (one scan!) We need to count, e.g. C( X=word ^ Y=label) How to implement Naïve Bayes with large vocabulary and small memory General technique: Stream and sort Very little seeking (only in merges in merge sort) Memory-efficient

Flaw: Large-vocabulary Naïve Bayes is Expensive to Use For each example id, y, x 1,.,x d in train: Sort the event-counter update messages Scan and add the sorted messages and output the final counter values Model size: max O(n), O( V dom(y) ) For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) = $ = log C(X = x j Y = y') + mq x & % j C(X = ANY Y = y') + m ' ) + log C(Y = y') + mq y ( C(Y = ANY) + m

The workaround I suggested For each example id, y, x 1,.,x d in train: Sort the event-counter update messages Scan and add the sorted messages and output the final counter values Initialize a HashSet NEEDED and a hashtable C For each example id, y, x 1,.,x d in test: Add x 1,.,x d to NEEDED For each event, C(event) in the summed counters If event involves a NEEDED term x read it into C For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) =.

Can we do better?

Can we do better? Test data id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= 5245 1054 2120 37 3 id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 What we d like C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]=

Can we do better? Step 1: group counters by word w How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index w aardvark agent zynga Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=worldNews]=564 C[w^Y=sports]=21,C[w^Y=worldNews]=4464 5245 1054 2120 37 3

If these records were in a key-value DB we would know what to do. Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

Recall: Stream and Sort Counting: sort messages so the recipient can stream through them example 1 example 2 example 3. C[x1] += D1 C[x1] += D2. Counting logic C[x] +=D Sort Logic to combine counter updates Machine A Machine C Machine B

Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Classification logic W 1,1 counters to id 1 W 1,2 counters to id 2 W i,j counters to id i

Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aarvark in zynga s farmville today! id 2 id 3. id 4 id 5.. w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Classification logic found ctrs to id 1 aardvark ctrs to id 1 today ctrs to id 1

Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aarvark in zynga s farmville today! id 2 id 3. id 4 id 5.. w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN ~ is the last ascii character Classification logic found ~ctrs to id 1 aardvark ~ctrs to id 1 today ~ctrs to id 1 % export LC_COLLATE=C means that it will sort after anything else with unix sort

Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aardvark in zynga s farmville today! id 2 id 3. id 4 id 5.. w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Counter records Classification logic found ~ctr to id 1 aardvark ~ctr to id 2 today ~ctr to id i requests Combine and sort

A stream-and-sort analog of the request-andanswer pattern Record of all event counts for each word w aardvark agent zynga Counts C[w^Y=sports]=2 Counter records w aardvark aardvark agent agent agent agent zynga Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] found ~ctr to id 1 aardvark ~ctr to id 1 today ~ctr to id 1 requests Combine and sort zynga ~ctr to id1 Request-handling logic

A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w aardvark aardvark agent agent agent agent zynga Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] zynga ~ctr to id1 requests Combine and sort Request-handling logic

A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w aardvark aardvark agent agent agent agent zynga Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 id1 ~ctr for zynga is. Combine and sort requests zynga ~ctr to id1 Request-handling logic

w A stream-and-sort analog of the request-andanswer pattern aardvark aardvark agent agent agent agent zynga zynga Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 id1 ~ctr for zynga is. id 1 found an aardvark in zynga s farmville today! id 2 id 3. id 4 id 5.. Request-handling logic Combine and sort????

id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 What we d wanted C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]= Key id1 id2 What we ended up with Value found aardvark zynga farmville today ~ctr for aardvark is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 w 2,1 w 2,2 w 2,3. ~ctr for w 2,1 is

Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests train.dat id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... counts.dat X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y= X= 5245 1054 2120 37 3

Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests words.dat w aardvark agent zynga Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=worldNews]=564 C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat output looks like this cat - words.dat sort java answerwordcountrequests input looks cat - test.dat sort testnbusingrequests like this words.dat found ~ctr to id 1 aardvark ~ctr to id 2 today ~ctr to id i w aardvark agent zynga Counts C[w^Y=sports]=2 w aardvark aardvark agent agent Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 agent ~ctr to id9854 ~ctr to id345

Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests Output looks like this Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 id1 ~ctr for zynga is. test.dat id 1 found an aardvark in zynga s farmville today! id 2 id 3. id 4 id 5..

Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat -test.dat sort testnbusingrequests Input looks like this Key id1 id2 Value found aardvark zynga farmville today ~ctr for aardvark is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 w 2,1 w 2,2 w 2,3. ~ctr for w 2,1 is

Outline Even more on stream-and-sort and naïve Bayes Another problem: meaningful phrase finding Implementing phrase finding efficiently Some other phrase-related problems

ACL Workshop 2003

Why phrase-finding? There are lots of phrases There s not supervised data It s hard to articulate What makes a phrase a phrase, vs just an n-gram? a phrase is independently meaningful ( test drive, red meat ) or not ( are interesting, are lots ) What makes a phrase interesting?

The breakdown: what makes a good phrase Two properties: Phraseness: the degree to which a given word sequence is considered to be a phrase Statistics: how often words co-occur together vs separately Informativeness: how well a phrase captures or illustrates the key ideas in a set of documents something novel and important relative to a domain Background corpus and foreground corpus; how often phrases occur in each

Phraseness 1 based on BLRT Binomial Ratio Likelihood Test (BLRT): Draw samples: n 1 draws, k 1 successes n 2 draws, k 2 successes Are they from one binominal (i.e., k 1 /n 1 and k 2 /n 2 were different due to chance) or from two distinct binomials? Define p 1 =k 1 / n 1, p 2 =k 2 / n 2, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k BLRT(n 1, k 1, n 2, k 2 ) = L(p 1, k 1, n 1 )L(p 2, k 2, n 2 ) L(p, k 1, n 1 )L(p, k 2, n 2 )

Phraseness 1 based on BLRT Binomial Ratio Likelihood Test (BLRT): Draw samples: n 1 draws, k 1 successes n 2 draws, k 2 successes Are they from one binominal (i.e., k 1 /n 1 and k 2 /n 2 were different due to chance) or from two distinct binomials? Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k BLRT(n 1, k 1, n 2, k 2 ) = 2 log L(p 1, k 1, n 1 )L(p 2, k 2, n 2 ) L(p, k 1, n 1 )L(p, k 2, n 2 )

Phraseness 1 based on BLRT Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k Phrase x y: W 1 =x ^ W 2 =y ϕ p (n 1, k 1, n 2, k 2 ) = 2 log L(p 1, k 1, n 1 )L(p 2, k 2, n 2 ) L(p, k 1, n 1 )L(p, k 2, n 2 ) comment k 1 C(W 1 =x ^ W 2 =y) how often bigram x y occurs in corpus C n 1 C(W 1 =x) how often word x occurs in corpus C k 2 C(W 1 x^w 2 =y) how often y occurs in C after a non-x n 2 C(W 1 x) how often a non-x occurs in C Does y occur at the same frequency after x as in other positions?

Informativeness 1 based on BLRT Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k Phrase x y: W 1 =x ^ W 2 =y and two corpora, C and B ϕ i (n 1, k 1, n 2, k 2 ) = 2log L(p 1, k 1, n 1 )L(p 2, k 2, n 2 ) L(p, k 1, n 1 )L(p, k 2, n 2 ) comment k 1 C(W 1 =x ^ W 2 =y) how often bigram x y occurs in corpus C n 1 C(W 1 =* ^ W 2 =*) how many bigrams in corpus C k 2 B(W 1 =x^w 2 =y) how often x y occurs in background corpus n 2 B(W 1 =* ^ W 2 =*) how many bigrams in background corpus Does x y occur at the same frequency in both corpora?

The breakdown: what makes a good phrase Phraseness and informativeness are then combined with a tiny classifier, tuned on labeled data. # % log p $ 1 p = s & # ( p = 1 & % ( ' $ 1+ e s ' Background corpus: 20 newsgroups dataset (20k messages, 7.4M words) Foreground corpus: rec.arts.movies.current-films June-Sep 2002 (4M words) Results?

The breakdown: what makes a good phrase Two properties: Phraseness: the degree to which a given word sequence is considered to be a phrase Statistics: how often words co-occur together vs separately Informativeness: how well a phrase captures or illustrates the key ideas in a set of documents something novel and important relative to a domain Background corpus and foreground corpus; how often phrases occur in each Another intuition: our goal is to compare distributions and see how different they are: Phraseness: estimate x y with bigram model or unigram model Informativeness: estimate with foreground vs background corpus

The breakdown: what makes a good phrase Another intuition: our goal is to compare distributions and see how different they are: Phraseness: estimate x y with bigram model or unigram model Informativeness: estimate with foreground vs background corpus To compare distributions, use KL-divergence Pointwise KL divergence

The breakdown: what makes a good phrase To compare distributions, use KL-divergence Pointwise KL divergence Phraseness: difference between bigram and unigram language model in foreground Bigram model: P(x y)=p(x)p(y x) Unigram model: P(x y)=p(x)p(y)

The breakdown: what makes a good phrase To compare distributions, use KL-divergence Pointwise KL divergence Informativeness: difference between foreground and background models Bigram model: P(x y)=p(x)p(y x) Unigram model: P(x y)=p(x)p(y)

The breakdown: what makes a good phrase To compare distributions, use KL-divergence Pointwise KL divergence Combined: difference between foreground bigram model and background unigram model Bigram model: P(x y)=p(x)p(y x) Unigram model: P(x y)=p(x)p(y)

The breakdown: what makes a good phrase To compare distributions, use KL-divergence Subtle advantages: BLRT scores more frequent in foreground and more frequent in background symmetrically, pointwise KL does not. Phrasiness and informativeness scores are more comparable straightforward combination w/o a classifier is reasonable. Language modeling is well-studied: extensions to n-grams, smoothing methods, we can build on this work in a modular way Combined: difference between foreground bigram model and background unigram model

Pointwise KL, combined

Why phrase-finding? Phrases are where the standard supervised bag of words representation starts to break. There s not supervised data, so it s hard to see what s right and why It s a nice example of using unsupervised signals to solve a task that could be formulated as supervised learning It s a nice level of complexity, if you want to do it in a scalable way.

Implementation Request-and-answer pattern Main data structure: tables of key-value pairs key is a phrase x y value is a mapping from a attribute names (like phraseness, freq-in-b, ) to numeric values. Keys and values are just strings We ll operate mostly by sending messages to this data structure and getting results back, or else streaming thru the whole table For really big data: we d also need tables where key is a word and val is set of attributes of the word (freq-in-b, freq-in-c, )

Generating and scoring phrases: 1 Stream through foreground corpus and count events W 1 =x ^ W 2 =y the same way we do in training naive Bayes: stream-and sort and accumulate deltas (a sum-reduce ) Don t bother generating boring phrases (e.g., crossing a sentence, contain a stopword, ) Then stream through the output and convert to phrase, attributesof-phrase records with one attribute: freq-in-c=n Stream through foreground corpus and count events W 1 =x in a (memory-based) hashtable. This is enough* to compute phrasiness: ψ p (x y) = f( freq-in-c(x), freq-in-c(y), freq-in-c(x y)) so you can do that with a scan through the phrase table that adds an extra attribute (holding word frequencies in memory). * actually you also need total # words and total #phrases.

Generating and scoring phrases: 2 Stream through background corpus and count events W 1 =x ^ W 2 =y and convert to phrase, attributes-ofphrase records with one attribute: freq-in-b=n Sort the two phrase-tables: freq-in-b and freq-in-c and run the output through another reducer that appends together all the attributes associated with the same key, so we now have elements like

Generating and scoring phrases: 3 Scan the through the phrase table one more time and add the informativeness attribute and the overall quality attribute Summary, assuming word vocabulary n W is small: Scan foreground corpus C for phrases: O(n C ) producing m C phrase records of course m C << n C Compute phrasiness: O(m C ) Assumes word counts fit in memory Scan background corpus B for phrases: O(n B ) producing m B Sort together and combine records: O(m log m), m=m B + m C Compute informativeness and combined quality: O(m)

Ramping it up keeping word counts out of memory Goal: records for xy with attributes freq-in-b, freq-in-c, freq-of-x-in-c, freq-of-y-in-c, Assume I have built built phrase tables and word tables.how do I incorporate the word attributes into the phrase records? For each phrase xy, request necessary word frequencies: Print x ~request=freq-in-c,from=xy Print y ~request=freq-in-c,from=xy Sort all the word requests in with the word tables Scan through the result and generate the answers: for each word w, a 1 =n 1,a 2 =n 2,. Print xy ~request=freq-in-c,from=w Sort the answers in with the xy records Scan through and augment the xy records appropriately

Generating and scoring phrases: 3 Summary 1. Scan foreground corpus C for phrases, words: O(n C ) producing m C phrase records, v C word records 2. Scan phrase records producing word-freq requests: O(m C ) producing 2m C requests 3. Sort requests with word records: O((2m C + v C )log(2m C + v C )) = O(m C log m C ) since v C < m C 4. Scan through and answer requests: O(m C ) 5. Sort answers with phrase records: O(m C log m C ) 6. Repeat 1-5 for background corpus: O(n B + m B logm B ) 7. Combine the two phrase tables: O(m log m), m = m B + m C 8. Compute all the statistics: O(m)

More cool work with phrases Turney: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews.acl 02. Task: review classification (65-85% accurate, depending on domain) Identify candidate phrases (e.g., adj-noun bigrams, using POS tags) Figure out the semantic orientation of each phrase using pointwise mutual information and aggregate SO(phrase) = PMI(phrase,'excellent') PMI(phrase,'poor')

Answering Subcognitive Turing Test Questions: A Reply to French - Turney

More from Turney LOW HIGHER HIGHEST

More cool work with phrases Locating Complex Named Entities in Web Text. Doug Downey, Matthew Broadhead, and Oren Etzioni, IJCAI 2007. Task: identify complex named entities like Proctor and Gamble, War of 1812, Dumb and Dumber, Secretary of State William Cohen, Formulation: decide whether to or not to merge nearby sequences of capitalized words axb, using variant of For k=1, ck is PM (w/o the log). For k=2, ck is Symmetric Conditional Probability

Downey et al results

Outline Even more on stream-and-sort and naïve Bayes Request-answer pattern Another problem: meaningful phrase finding Statistics for identifying phrases (or more generally correlations and differences) Also using foreground and background corpora Implementing phrase finding efficiently Using request-answer Some other phrase-related problems Semantic orientation Complex named entity recognition