Assessing the Consequences of Text Preprocessing Decisions

Size: px

Start display at page:

Download "Assessing the Consequences of Text Preprocessing Decisions"

Tobias Barton
6 years ago
Views:

1 Assessing the Consequences of Text Preprocessing Decisions Matthew J. Denny 1 Penn State University Arthur Spirling New York University October 15, Work supported by NSF Grant: DGE

2 Common Preprocessing Decisions P Punctuation Removal N Number Removal L Lowercasing S Stemming W Stopword Removal I Infrequent Term Removal 3 n-gram Inclusion 7 binary choices 2 7 = 128 specifications.

3 Supervised Learning

4 Unsupervised Learning

5 What Could Possibly Go Wrong?

6 Motivating Example UK Manifestos Corpus ( ) Labour, Liberal, Conservative Parties Wordfish Place documents in ideological space Process: 1. Select preprocessing specification 2. Run Wordfish

7 A-Priori Rankings Focus on 8 Manifestos. 1. Four general elections ( ) 2. Labour and Conservative parties Lab 1983 : longest suicide note in history, extremely left wing. Lab 1983 < Lab 1987 < Lab 1992 < Lab 1997 < Con 1992 < Con 1997 < Con 1987 < Con 1983

8 Wordfish Rankings Lab1983 Lab1987 Lab1992 Lab1997 Con1992 Con1997 Con1987 Con1983

9 Forking Paths 12 unique document rankings Substantially different conclusions. Specification Most Left Most Right P-N-S-W-I-3 Lab 1983 Cons 1983 N-S-W-3 Lab 1987 Cons 1987 N-L-3 Lab 1992 Cons 1987 N-L-S Lab 1983 Cons 1992

10 Another Example: Topic Models Senate Press Releases (Grimmer, 2010) Sample of 1,000 documents Senators. Note: no n-grams (computational cost). Procedure: 1. Find optimal number of topics for each specification (perplexity). 2. Run topic model (LDA)

11 Perplexity to Select Number of Topics Split data into train/test sets (80/20). Find minimum perplexity over num. topics. topics = {25, 50, 75, 100, 125, 150, 175, 200} 10-fold cross validation.

12 Optimal Number of Topics L S W P N S W W P W N S W P S W N L S W N L W P L W P N S L W L S P N W P L S 200 N W S W N S P N P S P N L W P L S W P N L S W N S P L N L P L P N L Optimal Number of Topics S I N L S N L I L S I P N L S N L S I P L S I P L S W I I L I L W I P S I P N S I L S W I P N L I P N L S I P N S W I N L S W I P N L S W I 50 P I W I N I N W I S W I P L I P W I P N I N S I P N W I N L W I P L W I P S W I N S W I P N L W I Number of Preprocessing Steps

13 Key Terms Example Select five key terms. How many topic top-terms are they in? iraq terror(ism) (al) qaeda insur(ance) stem (cell)

14 Key Terms in Topic Top-Terms

15 Key Terms: Average of 40 Initializations P N L S W I N L S W I P L S W I L S W I P N S W I N S W I P S W I S W I P N L W I N L W I P L W I L W I P N W I N W I P W I W I P N L S I N L S I P L S I L S I P N S I N S I P S I S I P N L I N L I P L I L I P N I N I P I I Iraq Terrorism Al Qaeda Insurance Stem Cell P N L S W N L S W P L S W L S W P N S W N S W P S W S W P N L W N L W P L W L W P N W N W P W W P N L S N L S P L S L S P N S N S P S S P N L N L P L L P N N P Iraq Terrorism Al Qaeda Insurance Stem Cell 0% <1% 1 2% 2 3% 3 4% 4 5% 5 6% 6 7% 7 8% 8 9% 9 10% 10%+

16 Forking Paths Different preprocessing different conclusions. Are we doomed?

17 Our Solution: pretext Assess consequences of preprocessing choices. Characterize a number of corpora. Easy to use R package!

18 Overview: Movements in Pairwise Document Distances No preprocessing as base case. Compare how pairwise document distances change with preprocessing. Measure how unusual these changes are.

19 Example With Three Documents Preprocessing Specification 1 2 Doc 1 Doc 2 6 Doc 1 Doc 3 Original DTM 4 1 Doc Doc 1 Doc 2 Doc Doc 1 Doc 3 Preprocessing Specification 2 2 Doc 2 Doc 3 Doc 4 1 Doc 2 Doc 1 1 Doc 3 6 Doc 2 Doc 3

20 Ranking Distance Changes Original DTM Preprocessing Specification 2 Doc 1 1 Doc 2 Doc 4 1 Doc 2 Doc 3 1 Doc 3 Doc 1 1 Doc 3 2 Doc 2 Doc 6 3 Doc 2 Doc 3 Original DTM Preproc. Spec. 2 Abs. Difference d(1,3) = 3 d(2, 3) = 2 d(1, 2) = 1 d(2,3) = 6 d(1,2) = 4 d(1, 3) = 1 d(1,3) = 2 d(2, 3) = 1 d(1, 2) = 1

21 Comparing Preprocessing Specifications Each specification will have a largest mover. Rank in other specifications (M 1,..., M 127 )? v M1 = (2 M2, 14 M3, 2 M4, 3 M5,..., 15 M127 ). Average of v Mi how unusual.

22 pretext Scores Consider top k largest moving doc pairs. Average across v Mi v Mi (k) Normalize by n(n 1) 2 (n = num docs) pretext score i = 2v M i (k) n(n 1)

23 Interpreting pretext Scores pretext scores range between 0 and 1. Lower score typical changes in document distances. Higher score atypical changes in document distances.

24 pretext Scores for Press Releases Preprocessing Combination pretext Score

25 Which Steps Matter? UK Manifestos State Of The Union Speeches Indian Treaties Top 100 Pairs Use NGrams Stemming Remove Stopwords Remove Punctuation Remove Numbers Remove Infrequent Terms Lowercase Regression Coefficient Death Row Statements Press Releases Top 100 Pairs Use NGrams Stemming Remove Stopwords Remove Punctuation Remove Numbers Remove Infrequent Terms Lowercase Regression Coefficient

26 Common Trends? (Danger!) Stopping, punctuation: highly variable. Stemming, numbers, lowercasing: not much effect. Including n-grams: potentially good. Infrequent terms: potentially bad.

27 Summary Preprocessing matters. Forking paths of inference. Our solution: pretext. General Advice: Some steps seem innocuous. Always check tell reader!

28 Happy Sloths Love R Packages! install.packages("pretext") ssrn.com/abstract= github.com/matthewjdenny/pretext

29 Wordfish and pretext pretext Score Wordfish Rank Use NGrams Stemming Remove Stopwords Remove Punctuation Remove Numbers Remove Infrequent Terms Lowercase Regression Coefficient

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million