Words vs. Terms. Words vs. Terms. Words vs. Terms. Information Retrieval cares about terms You search for em, Google indexes em Query:

Similar documents
Vectors and their uses

Ranked Retrieval (2)

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Discrete Probability and State Estimation

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Probability Review and Naïve Bayes

Midterm sample questions

Ling 289 Contingency Table Statistics

Classification & Information Theory Lecture #8

Do students sleep the recommended 8 hours a night on average?

CS 124 Math Review Section January 29, 2018

Feature selection. Micha Elsner. January 29, 2014

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Basic Probability Reference Sheet

MITOCW ocw f99-lec30_300k

Modeling Environment

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

The Noisy Channel Model and Markov Models

Introduction to Artificial Intelligence (AI)

Hypothesis tests

Notes on Latent Semantic Analysis

More Smoothing, Tuning, and Evaluation

CS 188: Artificial Intelligence Spring Today

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Discrete Probability and State Estimation

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

LINEAR ALGEBRA KNOWLEDGE SURVEY

Statistical methods for NLP Estimation

Note: Please use the actual date you accessed this material in your citation.

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

2.1 Definition. Let n be a positive integer. An n-dimensional vector is an ordered list of n real numbers.

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Bag RED ORANGE GREEN YELLOW PURPLE Candies per Bag

Quantitative Understanding in Biology 1.7 Bayesian Methods

Latent Dirichlet Allocation (LDA)

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Manning & Schuetze, FSNLP, (c)

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Dealing with the assumption of independence between samples - introducing the paired design.

Hi, my name is Dr. Ann Weaver of Argosy University. This WebEx is about something in statistics called z-

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Generative Clustering, Topic Modeling, & Bayesian Inference

Discrete Probability. Chemistry & Physics. Medicine

Proof Techniques (Review of Math 271)

Chapter 9: Sampling Distributions

STAT 285 Fall Assignment 1 Solutions

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Lesson 3-1: Solving Linear Systems by Graphing

Confidence Intervals

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Lecture 6: Finite Fields

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Approximate Inference

Embeddings Learned By Matrix Factorization

Multiple Regression Theory 2006 Samuel L. Baker

Variable Latent Semantic Indexing

Hierarchical Bayesian Nonparametrics

Dimensionality reduction

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Uni- and Bivariate Power

Information Retrieval

Solving Quadratic & Higher Degree Equations

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Topic 4 Randomized algorithms

Statistical NLP: Lecture 7. Collocations. (Ch 5) Introduction

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Linear Algebra, Summer 2011, pt. 2

Bayesian data analysis using JASP

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Natural Language Processing. Statistical Inference: n-grams

How Latent Semantic Indexing Solves the Pachyderm Problem

Bayesian Inference for Normal Mean

CS1800: Strong Induction. Professor Kevin Gold

Probability Distributions

Lecture 6: Lies, Inner Product Spaces, and Symmetric Matrices

Linear Algebra, Summer 2011, pt. 3

Chapter Three. Hypothesis Testing

7.1 What is it and why should we care?

Lecture 22 Exploratory Text Analysis & Topic Models

Example. If 4 tickets are drawn with replacement from ,

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Lesson 6-1: Relations and Functions

Precalculus idea: A picture is worth 1,000 words

Section 5.4. Ken Ueda

Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption

MI 4 Mathematical Induction Name. Mathematical Induction

Section 1.x: The Variety of Asymptotic Experiences

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Behavioral Data Mining. Lecture 2

Getting Started with Communications Engineering

Latent Semantic Analysis. Hongning Wang

Transcription:

Words vs. Terms Words vs. Terms Information Retrieval cares about You search for em, Google indexes em Query: What kind of monkeys live in Costa Rica? 600.465 - Intro to NLP - J. Eisner 1 600.465 - Intro to NLP - J. Eisner 2 Words vs. Terms Finding Phrases ( collocations ) What kind of monkeys live in Costa Rica? words? content words? word stems? word clusters? multi-word phrases? thematic content? (e.g., it s a habitat question ) kick the bucket directed graph iambic pentameter Osama bin Laden United Nations real estate quality control international best practice have their own meanings, translations, etc. 600.465 -Intro to NLP - J. Eisner 3 600.465 -Intro to NLP - J. Eisner 4 Finding Phrases ( collocations ) Just use common bigrams? Doesn t work: 80871 of the 58841 in the 26430 to the 15494 to be 12622 from the 11428 New York 10007 he said Possible correction just drop function words! Finding Phrases ( collocations ) Just use common bigrams? Better correction - filter by tags: N, N N, N P N 11487 New York 7261 United States 5412 Los ngeles 3301 last year 1074 chief executive 1073 real estate 600.465 - Intro to NLP - J. Eisner 5 600.465 - Intro to NLP - J. Eisner 6 1

Finding Phrases ( collocations ) data from anning & Schütze textbook (14 million words of NY Times) (Pointwise) utual Information Still want to filter out new companies These words occur together reasonably often but only because both are frequent Do they occur more often [among N pairs?] than you would expect by chance? Expect by chance: p(new) p(companies) ctually observed: p(new companies) mutual information binomial significance test = p(new) p(companies new) 600.465 - Intro to NLP - J. Eisner 7 ( old companies ) companies 15,820 14,287,181 ( old machines ) 14,303,001 TOTL 15,828 14,291,848 14,307,676 p(new companies) = p(new) p(companies)? I = log 2 p(new companies) / p(new)p(companies) N = log 2 (8/N) /((15828/N)(4675/N)) = log 2 1.55 = 0.63 I > 0 if and only if p(co s new) > p(co s) > p(co s new) Here I is positive but small. Would be larger for stronger collocations. 600.465 - Intro to NLP - J. Eisner 8 data from anning & Schütze textbook (14 million words of NY Times) Significance Tests data from anning & Schütze textbook (14 million words of NY Times) companies 1 583 584 ( old companies ) companies 1978 1,785,898 1,787,876 ( old machines ) TOTL 1979 1,786,481 1,788,460 Sparse data. In fact, suppose we divided all counts by 8: Would I change? No, yet we should be less confident it s a real collocation. Extreme case: what happens if 2 novel words next to each other? So do a significance test! Takes sample size into account. 600.465 -Intro to NLP - J. Eisner 9 companies 15,820 14,287,181 14,303,001 TOTL 15,828 14,291,848 14,307,676 ssume we have 2 coins that were used when generating the text. Following new, we flip coin to decide whether companies is next. Following new, we flip coin B to decide whether companies is next. We can see that was flipped 15828 times and got 8 heads. Probability of this: p 8 (1-p) 15820 * 15828! / 8! 15820! We can see that B was flipped 14291848 times and got 4667 heads. Our question: Do the two coins have different weights? (equivalently, are there really two separate coins or just one?) 600.465 -Intro to NLP - J. Eisner 10 data from anning & Schütze textbook (14 million words of NY Times) data from anning & Schütze textbook (14 million words of NY Times) companies 15,820 14,287,181 14,303,001 TOTL 15,828 14,291,848 14,307,676 Null hypothesis: same coin assume p null (co s new) = p null (co s new) = p null (co s) = 4675/14307676 p null (data) = p null (8 out of 15828)*p null (4667 out of 14291848) =.00042 Collocation hypothesis: different coins assume p coll (co s new) = 8/15828, p coll (co s new) = 4667/14291848 p coll (data) = p coll (8 out of 15828)*p coll (4667 out of 14291848) =.00081 So collocation hypothesis doubles p(data). We can sort bigrams by the log-likelihood ratio: log p coll (data)/p null (data) i.e., how sure are we that companies is more likely after new? 600.465 - Intro to NLP - J. Eisner 11 companies 1 583 584 companies 1978 1,785,898 1,787,876 TOTL 1979 1,786,481 1,788,460 Null hypothesis: same coin assume p null (co s new) = p null (co s new) = p null (co s) = 584/1788460 p null (data) = p null (1 out of 1979)*p null (583 out of 1786481) =.0056 Collocation hypothesis: different coins assume p coll (co s new) = 1/1979, p coll (co s new) = 583/1786481 p coll (data) = p coll (1 out of 1979)*p coll (583 out of 1786418) =.0061 Collocation hypothesis still increases p(data), but only slightly now. If we don t have much data, 2-coin model can t be much better at explaining it. Pointwise mutual information as strong as before, but based on much less data. So it s now reasonable to believe the null hypothesis that it s a coincidence. 600.465 - Intro to NLP - J. Eisner 12 2

data from anning & Schütze textbook (14 million words of NY Times) Function vs. Content Words companies 15,820 14,287,181 14,303,001 TOTL 15,828 14,291,848 14,307,676 Null hypothesis: same coin assume p null (co s new) = p null (co s new) = p null (co s) = 4675/14307676 p null (data) = p null (8 out of 15828)*p null (4667 out of 14291848) =.00042 Collocation hypothesis: different coins assume p coll (co s new) = 8/15828, p coll (co s new) = 4667/14291848 p coll (data) = p coll (8 out of 15828)*p coll (4667 out of 14291848) =.00081 Does this mean that collocation hypothesis is twice as likely? No, as it s far less probable a priori! (most bigrams ain t collocations) Bayes: p(coll data) = p(coll) * p(data coll) / p(data) isn t twice p(null data) 600.465 - Intro to NLP - J. Eisner 13 ight want to eliminate function words, or reduce their influence on a search Tests for content word: If it appears rarely? no: c(beneath) < c(kennedy) c(aside) «c(oil) in WSJ If it appears in only a few? better: Kennedy tokens are concentrated in a few docs This is traditional solution in IR If its frequency varies a lot among? best: content words come in bursts (when it rains, it pours?) probability of Kennedy is increased if Kennedy appeared in preceding text it is a self-trigger whereas beneath isn t 600.465 - Intro to NLP - J. Eisner 14 trick from Information Retrieval Each document in corpus is a length-k vector Or each paragraph, or whatever trick from Information Retrieval Each document in corpus is a length-k vector Plot all in corpus (0, 3, 3, 1, 0, 7,... 1, 0) a single document Pretty sparse, so pretty noisy! Wish we could smooth it: what would the document look like if it rambled on forever? 600.465 -Intro to NLP - J. Eisner 15 600.465 -Intro to NLP - J. Eisner 16 Reduced plot is a perspective drawing of true plot It projects true plot onto a few axes a best choice of axes shows most variation in the data. Found by linear algebra: Principal Components nalysis (PC) SVD plot allows best possible reconstruction of true plot (i.e., can recover 3-D coordinates with minimal distortion) Ignores variation in the axes that it didn t pick Hope that variation s just noise and we want to ignore it theme B theme 600.465 - Intro to NLP - J. Eisner 17 600.465 - Intro to NLP - J. Eisner 18 3

SVD finds a small number of theme vectors pproximates each doc as linear combination of themes Coordinates in reduced plot = linear coefficients How much of theme in this document? How much of theme B? Each theme is a collection of words that tend to appear together New coordinates might actually be useful for Info Retrieval To compare 2, or a query and a document: Project both into reduced space: do they have themes in common? Even if they have no words in common! theme B theme B theme 600.465 - Intro to NLP - J. Eisner 19 theme 600.465 - Intro to NLP - J. Eisner 20 Themes extracted for IR might help sense disambiguation Each word is like a tiny document: (0,0,0,1,0,0, ) Express word as a linear combination of themes Each theme corresponds to a sense? E.g., Jordan has ideast and Sports themes (plus dvertising theme, alas, which is same sense as Sports) Word s sense in a document: which of its themes are strongest in the document? Groups senses as well as splitting them One word has several themes and many words have same theme perspective on Principal Components nalysis (PC) Imagine an electrical circuit that connects to docs of strengths (how strong is each term in each document?) Each connection has a weight given by the. 600.465 -Intro to NLP - J. Eisner 21 600.465 -Intro to NLP - J. Eisner 22 Which is term 5 strong in? Which are 5 and 8 strong in? docs 2, 5, 6 light up strongest. This answers a query consisting of 5 and 8! 600.465 - Intro to NLP - J. Eisner 23 really just multiplication: term vector (query) x strength = doc vector 600.465 - Intro to NLP - J. Eisner 24 4

Conversely, what are strong in document 5? SVD approximates by smaller 3-layer network Forces sparse data through a bottleneck, smoothing it gives doc 5 s coordinates! themes 600.465 - Intro to NLP - J. Eisner 25 600.465 - Intro to NLP - J. Eisner 26 I.e., smooth sparse data by approx: B encodes camera angle, B gives each doc s new coords Completely symmetric! Regard, B as projecting and docs into a low-dimensional theme space where their similarity can be judged. B themes B themes 600.465 -Intro to NLP - J. Eisner 27 600.465 -Intro to NLP - J. Eisner 28 Completely symmetric. Regard, B as projecting and docs into a low-dimensional theme space where their similarity can be judged. Cluster (helps sparsity problem!) Cluster words Compare a word with a doc Identify a word s themes with its senses sense disambiguation by looking at document s senses Identify a document s themes with its topics topic categorization If you ve seen SVD before SVD actually decomposes = D B exactly = camera angle (orthonormal); D diagonal; B orthonormal D B 600.465 - Intro to NLP - J. Eisner 29 600.465 - Intro to NLP - J. Eisner 30 5

If you ve seen SVD before Keep only the largest j < k diagonal elements of D This gives best possible approximation to using only j blue units If you ve seen SVD before Keep only the largest j < k diagonal elements of D This gives best possible approximation to using only j blue units D B D B 600.465 - Intro to NLP - J. Eisner 31 600.465 - Intro to NLP - J. Eisner 32 If you ve seen SVD before To simplify picture, can write (DB ) = B Political dimensions B = DB How should you pick j (number of blue units)? Just like picking number of clusters: How well does system work with each j (on held-out data)? 600.465 -Intro to NLP - J. Eisner 33 600.465 -Intro to NLP - J. Eisner 34 Dimensions of Personality 600.465 - Intro to NLP - J. Eisner 35 6