Inverse Document Frequency (IDF): A Measure of Deviations from Poisson

Similar documents
CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

, which yields. where z1. and z2

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

READING STATECHART DIAGRAMS

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Activity Guide Loops and Random Numbers

Lab 1 The Scientific Method

Hypothesis Tests for One Population Mean

We can see from the graph above that the intersection is, i.e., [ ).

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Five Whys How To Do It Better

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart

Kinetic Model Completeness

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Subject description processes

Writing Guidelines. (Updated: November 25, 2009) Forwards

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

How do scientists measure trees? What is DBH?

Computational modeling techniques

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

CONSTRUCTING STATECHART DIAGRAMS

THE LIFE OF AN OBJECT IT SYSTEMS

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Physics 2B Chapter 23 Notes - Faraday s Law & Inductors Spring 2018

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Getting Involved O. Responsibilities of a Member. People Are Depending On You. Participation Is Important. Think It Through

Experiment #3. Graphing with Excel

INSTRUMENTAL VARIABLES

Please Stop Laughing at Me and Pay it Forward Final Writing Assignment

Differentiation Applications 1: Related Rates

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

Part 3 Introduction to statistical classification techniques

ENSC Discrete Time Systems. Project Outline. Semester

CHM112 Lab Graphing with Excel Grading Rubric

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

Comparing Several Means: ANOVA. Group Means and Grand Mean

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Thermodynamics Partial Outline of Topics

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

Simple Linear Regression (single variable)

BASD HIGH SCHOOL FORMAL LAB REPORT

AIP Logic Chapter 4 Notes

What is Statistical Learning?

Distributions, spatial statistics and a Bayesian perspective

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

Aristotle I PHIL301 Prof. Oakes Winthrop University updated: 3/14/14 8:48 AM

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

On Huntsberger Type Shrinkage Estimator for the Mean of Normal Distribution ABSTRACT INTRODUCTION

Physics 2010 Motion with Constant Acceleration Experiment 1

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

CESAR Science Case The differential rotation of the Sun and its Chromosphere. Introduction. Material that is necessary during the laboratory

B. Definition of an exponential

Paragraph 1: Introduction

AP Statistics Notes Unit Two: The Normal Distributions

Chapter 3: Cluster Analysis

Eric Klein and Ning Sa

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

(Communicated at the meeting of January )

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Interference is when two (or more) sets of waves meet and combine to produce a new pattern.

A Quick Overview of the. Framework for K 12 Science Education

Determining the Accuracy of Modal Parameter Estimation Methods

Preparation work for A2 Mathematics [2018]

AP Literature and Composition. Summer Reading Packet. Instructions and Guidelines

Chemistry 20 Lesson 11 Electronegativity, Polarity and Shapes

" 1 = # $H vap. Chapter 3 Problems

arxiv:hep-ph/ v1 2 Jun 1995

Hubble s Law PHYS 1301

Preparation work for A2 Mathematics [2017]

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

A Polarimetric Survey of Radio Frequency Interference in C- and X-Bands in the Continental United States using WindSat Radiometry

SPH3U1 Lesson 06 Kinematics

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Chapter 8: The Binomial and Geometric Distributions

IN a recent article, Geary [1972] discussed the merit of taking first differences

Physics 212. Lecture 12. Today's Concept: Magnetic Force on moving charges. Physics 212 Lecture 12, Slide 1

Methods for Determination of Mean Speckle Size in Simulated Speckle Pattern

**DO NOT ONLY RELY ON THIS STUDY GUIDE!!!**

Associated Students Flacks Internship

Introduction to Spacetime Geometry

1 The limitations of Hartree Fock approximation

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Lab #3: Pendulum Period and Proportionalities

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Transcription:

Inverse Dcument Frequency (IDF): A Measure f Deviatins frm Pissn Kenneth W. Church William A. Gale AT&T Bell Labratries Murray Hill, NJ, USA 07974 kwc@research.att.cm Abstract Lw frequency wrds tend t be rich in cntent, and vice versa. But nt all equally frequent wrds are equally meaningful. We will use inverse dcument frequency (IDF), a quantity brrwed frm Infrmatin Retrieval, t distinguish wrds like smewhat and byctt. Bth smewhat and byctt appeared apprximately 1000 times in a crpus f 1989 Assciated Press articles, but byctt is a better keywrd because its IDF is farther frm what wuld be expected by chance (Pissn). 1. Dcument frequency is similar t wrd frequency, but different Wrd frequency is cmmnly used in all srts f natural language applicatins. The practice implicitly assumes that wrds (and ngrams) are distributed by a single parameter distributin such as a Pissn r a Binmial. But we find that these distributins d nt fit the data very well. Bth the Pissn and Binmial assume that the variance ver dcuments is n larger than the mean, and yet, we find that it can be quite a bit larger, especially fr interesting wrds such as byctt where there are hidden variables such as tpic that cnspire t undermine the independence assumptin behind the Pissn and the Binmial. Much better fits are btained by intrducing a secnd parameter such as inverse dcument frequency (IDF). Inverse dcument frequency (IDF) is cmmnly used in Infrmatin Retrieval (Sparck Jnes, 1972). IDF is defined as lg 2 df w / D, where D is the number f dcuments in the cllectin and df w is the dcument frequency, the number f dcuments that cntain w. Obviusly, there is a strng relatinship between dcument frequency, df w, and wrd frequency, f w. The relatinship is shwn in Figure 1, a plt f lg 10 f w and IDF fr 193 wrds selected frm a 50 millin wrd crpus f 1989 Assciated Press (AP) Newswire stries (D = 85, 432 stries). Althugh lg 10 f w is highly crrelated with IDF (ρ = 0. 994), it wuld be a mistake t assume that the tw variables are cmpletely predictable frm ne anther. Indeed, the experience f the Infrmatin Retrieval cmmunity has indicated that IDF is a very useful quantity. Attempts t replace IDF with f w (r sme simple transfrm f f w ) have nt been very successful. Figure 2 shws ne such attempt. It cmpares the bserved IDF with IDF ˆ, an estimate based n f. Assume that a dcument is merely a bag f wrds with n interesting structure (cntent). Wrds are randmly generated by a Pissn prcess, π. The prbability f k instances f a wrd w is π(θ,k) where θ _ f : w D

π(θ, k) = _ e θ θ k fr k = 0, 1,... Pissn k! In particular, the prbability that w will nt be fund in a dcument is π(θ, 0 ). Cnversely, the prbability f at least ne w is 1 π(θ, 0 ). And therefre, IDF ught t be: IDF ˆ = lg 2 ( 1 π(θ, 0 )) = lg 2 ( 1 e θ ) Predicted IDF Figure 2 cmpares IDF with IDF ˆ. Nte that IDF ˆ is systematically t lw, indicating that the predictins are missing crucial generalizatins. Dcuments are mre than just a bag f wrds. The predictin errrs are shwn in mre detail in Figure 3, which plts the residual IDF (difference between predicted and bserved) as a functin f lg 10 f w fr the same 193 wrds shwn in Figure 2. The predictin errrs are relatively large in the middle f the frequency range, and smaller at bth ends. Unfrtunately, we believe the wrds in the middle are ften the mst imprtant wrds fr Infrmatin Retrieval purpses. IDF 5 10 15 0 1 2 3 4 5 lg10 frequency Figure 1: IDF is highly crrelated with lg frequency (ρ = 0. 994). The circles shw lg 10 f and IDF fr 193 wrds selected frm a crpus f 1989 Assciated Press Newswire stries (D = 85, 432). 2. A Gd Keywrd is far frm Pissn T get a better lk at the crucial differences between IDF and f in the middle frequency range (f 10 3 ), we selected a set f 53 wrds fr further investigatin with 1000 < f < 1020 in the 1989 AP crpus. The 53 wrds are shwn in Table 1, srted by df. Nte that the wrds near the tp f the list tend t be mre apprpriate fr use in an infrmatin retrieval system than the wrds tward the bttm f the list. Stries that mentin the wrd byctt, fr example, are likely t be abut byctts. In cntrast, stries that mentin the wrd smewhat culd be abut practically anything. 1

predicted IDF 5 10 15 5 10 15 bserved IDF Figure 2: The bserved IDF is systematically lwer than what wuld be expected under a Pissn, lg 2 ( 1 e f / D ). All but 6 f the circles fall belw the x = y line. The data are the same as in Figure 1. Why is IDF such a useful quantity? One might try t answer the questin in terms f infrmatin thery (Shannn, 1948). IDF can be thught f as the usefulness in bits f a keywrd t a keywrd retrieval system. If we tell yu that the dcument that we are lking fr has the keywrd byctt, then we have narrwed the search space dwn t just 676/D dcuments. But, this answer desn t explain the fundamental difference between byctt and smewhat. byctt has an IDF f lg 2 676/ D = 7. 0 bits, nly a little mre than smewhat, which has an IDF f lg 2 979/ D = 6. 4. And yet, byctt is a reasnable keywrd and smewhat is nt. A gd keywrd, like byctt, picks ut a very specific set f dcuments. The prblem with smewhat is that it behaves almst like chance (Pissn). Under a Pissn, the 1013 instances f smewhat shuld be fund in apprximately D( 1 π(θ, 0 ) ) D( 1 π( 1013/ 85432, 0 ) ) 1007 dcuments. In fact, smewhat was fund in 979 dcuments, nly a little less than what wuld have been expected by chance. Gd keywrds tend t bunch up int many fewer dcuments. byctt, fr example, bunches up int nly 676 dcuments, much less than chance (D( 1 π( 1009/ 85432, 0 ) ) 1003). Almst all wrds are mre interesting in this sense than Pissn, but gd keywrds like byctt are a lt mre interesting than Pissn, and crummy nes like smewhat are nly a little mre interesting than Pissn. 1. There is a weak tendency fr nuns t appear higher n the list than nn-nuns, thugh tendency is t weak t explain the pattern f the systematic deviatins frm Pissn. In additin, there are plenty f exceptins in bth directins: rape, pl, grants, cde and premier are nt necessarily nuns, and sweeping, leads, bund and wrry are nt necessarily nn-nuns.

predicted IDF - bserved IDF 0.0 0.4 0.8 1.2 Petitiner Frmm Germans Gray Mn Stevens culd which 0 1 2 3 4 5 lg10 frequency Figure 3: The Predictin errrs are systematically psitive. The errrs tend t be larger in the middle f the frequency range (Germans), and smaller at bth ends (Frmm, which). The data are the same as in Figures 1-2. On this accunt, a gd keywrd is ne that behaves very differently frm the null hypthesis (Pissn). We cnjecture that the best keywrds tend t be fund tward the middle f the frequency range, where there are relatively large deviatins frm Pissn, as illustrated in Figure 3. This hypthesis runs cunter t the standard practice in Infrmatin Retrieval f weighting wrds by IDF, favring extremely rare wrds, n matter hw they are distributed. Of curse, IDF is but ne f many ways t shw deviatins frm chance. Figure 4 shws the distributins fr byctt and smewhat. Nte that smewhat is much clser t Pissn in almst any sense f clseness that ne might cnsider. Three measures f clseness are presented in Table 2: IDF, variance (σ 2 ), and entrpy (H). Table 2 cmpares the tp 10 wrds in Table 1 (labeled better keywrds ) with the bttm 10 wrds in Table 1 (labeled wrse keywrds ). The better keywrds have mre IDF, mre variance and less entrpy than what wuld be expected under a Pissn with θ f / D 1000/ 85, 432 0. 012. 3. Hw rbust are these deviatins frm chance? We were cncerned that the crucial deviatins frm Pissn behavir might nt hld up if we lked at anther crpus f similar material. Figure 5 shws the wrd byctt in five different years f the AP news. The fat tails shw up in each f the five years. Clearly, the nn-pissn phenmenn is rbust. Figures 6 and 7 cmpare IDF and lg 10 σ 2 fr the 53 wrds in Table 1, and find that IDF and lg 10 σ 2 are reasnably stable acrss years. The crrelatins f IDF and lg 10 σ 2 acrss years are presented in Tables 3-4. All f the crrelatins are quite large. The crrelatins fr IDF are perhaps smewhat larger than thse fr lg 10 σ 2, suggesting that IDF may be smewhat mre rbust, which is nt

Table 1: Mre IDF (less df) Mre Cntent df w df w df w df w 435 gvernrs 724 pl 827 unity 937 wrry 506 festival 740 restaurants 845 bed 940 cntaining 551 gang 745 grants 847 castal 946 explained 553 bullin 752 scheme 851 educatinal 951 bund 563 attendants 754 cde 853 lying 953 leads 623 rape 761 premier 853 neighbr 955 happens 639 palace 775 wire 863 tragedy 960 imprving 676 byctt 781 custmer 867 acquire 960 welcmed 687 rutes 783 rms 874 restred 961 triggered 690 incentives 786 engineering 905 legitimate 966 sweeping 695 pverty 803 clr 910 deliver 968 fairly 718 dnatins 811 pssessin 914 types 969 heading 722 lawsuits 815 prjected 929 reject 979 smewhat 986 nting # f Dcuments = D * Pr(k) 1 10 100 10000 smewhat Pissn byctt 0 2 4 6 8 Figure 4: Mst wrds have a fatter tail than Pissn (slid line). The deviatins frm Pissn are mre salient fr gd keywrds like byctt, than fr crummy keywrds like smewhat. k surprising given that empirical estimates f variance are ntriusly subject t utliers. Nne f the crrelatins in Tables 3 and 4 can be attributed t wrd frequency effects since the 53 wrds were all chsen with almst the same 1989 frequency. In general, the crrelatins in Tables 3-4 are larger near the diagnal, suggesting that estimates degrade ver time. If yu want t predict next year s IDF, it is better t use this year s estimate than a ten-yearld estimate.

Table 2: Gd keywrds have mre IDF, mre var and less entrpy than Pissn Better Keywrds Wrse Keywrds IDF var entrpy IDF var entrpy 7.6 0.060 0.057 gvernrs 6.5 0.013 0.092 leads 7.4 0.044 0.064 festival 6.5 0.013 0.092 happens 7.3 0.043 0.067 gang 6.5 0.013 0.092 imprving 7.3 0.028 0.068 bullin 6.5 0.013 0.092 welcmed 7.2 0.042 0.068 attendants 6.5 0.013 0.092 triggered 7.1 0.032 0.073 rape 6.5 0.013 0.093 sweeping 7.1 0.028 0.074 palace 6.5 0.013 0.093 fairly 7.0 0.027 0.077 byctt 6.5 0.013 0.093 heading 7.0 0.026 0.078 rutes 6.4 0.013 0.093 smewhat 7.0 0.025 0.078 incentives 6.4 0.012 0.092 nting 6.4 0.012 0.092 Pissn 6.4 0.012 0.092 Pissn # f Dcuments = D * Pr(k) 1 10 100 10000 Pissn K 0 1 2 3 4 5 6 Figure 5: The strng deviatins frm Pissn fr the wrd byctt shw up very clearly in the AP in 1988, 1989, 1990, 1991 and 1992 (dtted lines). Katz K-mixture (Katz, persnal cmmunicatin), the slid line labelled K, fits the data better than the Pissn. k Anther way t cnfirm that ur measurements f IDF, variance and H have cnsequences acrss years in the AP data, is t nte that measurements f IDF, variance and H in 1989 can be used t predict wrd frequency in sme ther year. The crrelatins are shwn in Table 5. They may nt nt be large, but they are t large t be due t chance and they all pint in the same directin. The crrelatins cannt be attributed t variatins in frequency in 1989, since all 53 wrds have almst the same 1989 frequency. Clearly, there are sme interesting systematic relatinships between IDF/variance/H and f that hld up t replicatin acrss multiple years in the AP, measurement errrs, and ther surces f nise.

6.4 7.0 7.6 1988 6.4 6.8 7.2 7.6 1989 1990 6.0 7.5 9.0 6.5 8.0 6.5 7.5 6.0 8.0 1991 1992 6 7 8 9 6.5 7.5 6.5 7.5 6 7 8 9 10 4. Katz K-mixture Figure 6: IDF in ne year f the AP is very predictive f IDF in anther (fr the 53 wrds in Table 1). Each scatter plt cmpares IDF in ne year with IDF in anther. The fact that mst f the pints line up fairly well indicates that IDF values are strngly crrelated acrss years. The crrelatins are shwn in Table 3. Clearly, the Pissn des nt fit ur data very well, especially fr gd keywrds like byctt. This is, hwever, a negative result. Can we say smething mre cnstructive? Katz (persnal cmmunicatin) prpsed the fllwing alternative t the Pissn. prbability f k instances f w in a dcument. Pr K (k) is the Pr K (k) = ( 1 α) δ k, 0 + α ( β ) k K-mixture β + 1 β + 1 δ k, 0 is 1 when k = 0, and 0 therwise. Katz K-mixture distributin can be thught f as a mixture f Pissns. Suppse that, within dcuments, byctt is distributed by a Pissn prcess, but, acrss dcuments, the Pissn parameter θ is allwed t vary frm ne dcument t anther depending n hw much the dcument is abut byctts. In ther wrds, Pr K (k) can be expressed as a cnvlutin f Pissns with a density functin φ: Pr(k) = φ(θ) π(θ,k) dθ fr k = 0, 1,... Pissn Mixture 0 In this way, the θs can depend n an infinite number f unknwable hidden variables, e.g., what the dcuments are abut, wh wrte them, when they were written, what was ging n in the wrld when they were written, etc., but we dn t need t knw these dependencies fr any particular dcument. All we need t knw is φ, the density f θs, aggregated ver all pssible cmbinatins f hidden variables.

1988-1.8-1.4-2.2-1.6-1.0-2.0-1.2-1.8-1.2 1989 1990-2.0-1.4-2.2-1.4 1991-2.0-1.4-2.0-1.6-2.2-1.6-2.2-1.4 1992 Figure 7: lg 10 σ 2 is als predictable frm ne year t the next, thugh maybe nt as predictable as IDF (fr the 53 wrds in Table 1). The crrelatins are shwn in Table 4. Table 3: Crrelatins f IDF acrss years 1988 1989 1990 1991 1992 1988 0.80 0.76 0.68 0.60 1989 0.80 0.75 0.67 0.48 1990 0.76 0.75 0.85 0.76 1991 0.68 0.67 0.85 0.84 1992 0.60 0.48 0.76 0.84 Table 4: Crrelatins f lg var acrss years 1988 1989 1990 1991 1992 1988 0.74 0.61 0.25 0.67 1989 0.74 0.73 0.42 0.51 1990 0.61 0.73 0.50 0.61 1991 0.25 0.42 0.50 0.62 1992 0.67 0.51 0.61 0.62 Table 5: Crrelatins f IDF, lg var and H in 1989 with lg f in ther years 1988 lg f 1990 lg f 1991 lg f 1992 lg f 1989 IDF 0.18 0.14 0.20 0.17 1989 lg var 0.13 0.11 0.14 0.12 1989 H 0.17 0.15 0.20 0.16 In the case f Katz K-mixture, φ(θ) is assumed t be ( 1 α) δ(θ) + β α e β θ. δ(k) is Dirac s delta functin, when k = 0, and therwise, 0.

Katz K-mixture has tw parameters, α and β. The α parameter determines the fractin f relevant and irrelevant dcuments. 1 α f the dcuments have n chance f mentining byctt (θ = 0) because they are ttally irrelevant t byctts. The β parameter determines the average θ amng the relevant dcuments. The tw parameters, α and β, can be fit frm almst any pair f variables cnsidered thus far, e.g., f, IDF, σ 2, H. We have fund that f and IDF are particularly easy t wrk with, and are mre rbust than sme thers such as σ 2. β D f 2 IDF 1 α D f β 1 It has been ur experience that Katz K-mixture fits the data much better than the Pissn, as can be seen in Figure 5. Unlike the Pissn, the K-mixture has tw parameters, α and β, and can therefre accunt fr the fact that IDF and f are nt cmpletely predictable frm ne anther. In related wrk (Church and Gale, submitted), we lked at a number f different Pissn mixtures, and fund that ur data can als be fit by a negative binmial, which can be viewed as a Pissn mixture where φ NB (θ) is a Gamma distributin (Jhnsn and Ktz, 1969). See Msteller and Wallace (1964) fr an example f hw t use the negative binmial in a Bayesian discriminatin task. It is straightfrward t generalize the Msteller and Wallace apprach t use Katz K-mixture r any ther mixture f Pissns. 5. Cnclusins Dcuments are much mre than just a bag f wrds. The Pissn distributin predicts that lightning is unlike t strike twice in a single dcument. We shuldn t expect t see tw r mre instances f byctt in the same dcument (unless there is sme srt f hidden dependency that ges beynd the Pissn). But when it rains, it purs. If a dcument is abut byctts, we shuldn t be surprised t find tw byctts r even a half dzen in a single dcument. The standard use f the Pissn in mdeling the distributin f wrds and ngrams fails t fit the data except where there are almst n interesting hidden dependencies as in the case f smewhat. Why are the deviatins frm Pissn mre salient fr interesting wrds like byctt than fr bring wrds like smewhat? Many applicatins such as infrmatin retrieval, text categrizatin, authr identificatin and wrd-sense disambiguatin attempt t discriminate dcuments n the basis f certain hidden variables such as tpic, authr, genre, style, etc. The mre that a keywrd (r ngram) deviates frm Pissn, the strnger the dependence n hidden variables, and the mre useful the keywrd (r ngram) is fr discriminating dcuments n the basis f these hidden dependences. Similar arguments apply in a hst f ther imprtant applicatins such as text cmpressin and language mdeling fr speech recgnitin where it is desirable fr wrd and ngram prbabilities t adapt apprpriately t frequency changes due t varius hidden dependencies. We have used dcument frequency, df, a cncept brrwed frm Infrmatin Retrieval, t find deviatins frm Pissn behavir. Dcument frequency is similar t wrd frequency, but different in a subtle but crucial way. Althugh inverse dcument frequency (IDF) and lg 10 f are extremely highly

crrelated (ρ = 0. 994), it wuld be a mistake t try t mdel ne with a simple transfrm f the ther. Figure 5 shwed ne such attempt, where f was transfrmed int a predicted IDF by intrducing a Pissn assumptin: IDF ˆ = lg 2 ( 1 e θ ), with θ = _ f. w Unfrtunately, the predictin errrs were D relatively large fr the mst imprtant keywrds, wrds with mderate frequencies such as Germans. T get a better lk at the subtle differences between dcument frequency and wrd frequency, we fcused ur attentin n a set f 53 wrds that all had apprximately the same wrd frequency in a crpus f 1989 AP stries. Table 1 shwed that wrds with larger IDF tend t have mre cntent. byctt, fr example, is a better keywrd than smewhat because it bunches up int a relatively small set f dcuments. Table 2 shwed that variance and entrpy can als be used as a measure f cntent (at least amng a set f wrds with mre r less the same wrd frequency). A gd keywrd like byctt is farther frm Pissn (chance) than a crummy keywrd like smewhat by almst any sense f clseness that ne might cnsider, e.g., IDF, variance, entrpy. These crucial deviatins frm Pissn are rbust. We shwed in sectin 4 that deviatins frm Pissn in ne year f the AP can be used t predict deviatins in anther year f the AP. Acknwledgments This wrk benefited cnsiderably frm extensive discussins with Slava Katz. References Church, K., and Gale, W. (submitted) Pissn Mixtures. Jhnsn, N., and Ktz, S. (1969) Discrete Distributins, Hughtn Mifflin, Bstn. Katz, S. (in preparatin). Msteller, Fredrick, and David Wallace (1964) Inference and Disputed Authrship: The Federalist, Addisn-Wesley, Reading, Massachusetts. Saltn, G. (1989) Autmatic Text Prcessing, Addisn-Wesley. Shannn, C. (1948) The Mathematical Thery f Cmmunicatin, Bell System Technical Jurnal. Sparck Jnes, K. (1972) A Statistical Interpretatin f Term Specificity and its Applicatin in Retrieval, Jurnal f Dcumentatin, 28:1, pp. 11-21. van Rijsbergen, C. (1979) Infrmatin Retrieval, Secnd Editin, Butterwrths, Lndn.