Introduction to NLP. Text Classification

Size: px

Start display at page:

Download "Introduction to NLP. Text Classification"

Mark Wilkinson
6 years ago
Views:

1 NLP

2 Introduction to NLP Text Classification

3 Classification Assigning documents to predefined categories topics, languages, users A given set of classes C Given x, determine its class in C Hierarchical vs. flat Overlapping (soft vs non-overlapping (hard

4 Classification Ideas: manual classification using rules e.g., Columbia AND University à Education Columbia AND South Carolina à Geography Popular techniques generative (-nn, Naïve Bayes vs. discriminative (SVM, regression Generative model joint prob p(x,y and use Bayesian prediction to compute p(y x Discriminative model p(y x directly.

5 Representations or Document Classification (And Clustering Typically: vector-based Words: cat, dog, etc. eatures: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed

6 Naïve Bayesian classifiers Naïve Bayesian classifier Assuming statistical independence eatures = words (or phrases typically, ( (,..., (,..., (, P C d P C d P C d P = = = = j j j j P C d P C d P C d P ( ( (,..., (

7 Issues with Naïve Bayes Where do we get the values P ( d C use maximum lielihood estimation (N i /N Same for the conditionals these are based on a multinomial generator and the MLE estimator is (T ji /ΣT ji Smoothing is needed why Laplace smoothing ((T ji +1/Σ(T ji +1 Implementation how to avoid floating point underflow

8 Spam Recognition Return-Path: X-Sieve: CMU Sieve 2.2 rom: "Ibrahim Galadima" Reply-To: To: Subject: Gooday DEAR SIR UNDS OR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD O INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE O MY SEARCH OR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONIDENTIAL TRANSACTION INVOLVING THE! TRANSER O UND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAE OREIGN ACCOUNT

9 SpamAssassin Examples: body Incorporates a tracing ID number body HTML and text parts are different header Date: is 3 to 6 hours before Received: date body HTML font size is huge header Attempt to obfuscate words in Subject: header Subject =~ /^urgent(?:[\s\w]*(dollar.{1,40} (?:alert response assistance proposal reply warning noti(?:ce fication greeting matter/i

10 eature Selection: The Χ 2 Test or a term t: I t 0 1 C C=class, i t = feature Testing for independence: P(C=0,I t =0 should be equal to P(C=0 P(I t =0 P(C=0 = ( /n P(C=1 = 1-P(C=0 = ( /n P(I t =0 = ( 00 +K 10 /n P(I t =1 = 1-P(I t =0 = ( /n

11 eature Selection: The Χ 2 Test Χ 2 = ( n( 11 ( ( ( High values of Χ 2 indicate lower belief in independence. In practice, compute Χ 2 for all words and pic the top among them.

12 eature Selection: Mutual Information No document length scaling is needed Documents are assumed to be generated according to the multinomial model Measures amount of information: if the distribution is the same as the bacground distribution, then MI=0 X = word; Y = class MI( X, Y = P( x, ylog P( x, y x y P( x P( y

13 Well-nown Datasets 20 newsgroups Reuters reuters21578/ Cats: grain, acquisitions, corn, crude, wheat, trade WebKB course, student, faculty, staff, project, dept, other RCV1 Larger Reuters corpus

14 Evaluation Of Text Classification Microaveraging average over classes Macroaveraging uses pooled table

15 Vector Space Classification x2 topic2 topic1 x1

16 Decision Surfaces x2 topic2 topic1 x1

17 Decision Trees x2 topic2 topic1 x1

18 Linear Boundary x2 topic2 topic1 x1

19 Vector Space Classifiers Using centroids Boundary line that is equidistant from two centroids

20 Linear Separators Two-dimensional line: w 1 x 1 +w 2 x 2 =b is the linear separator w 1 x 1 +w 2 x 2 >b for the positive class In n-dimensional spaces:! w T! x = b

21 Decision Boundary x2 w! topic2 topic1 x1

22 Example Bias b=0 Document is A D E H Its score will be 0.6*1+0.4*1+0.4*1+(-0.5*1 = 0.9>0 w i x i w i x i 0.6 A -0.7 G 0.5 B -0.5 H 0.5 C -0.3 I 0.4 D -0.2 J 0.4 E -0.2 K L

23 Perceptron Algorithm Input: Algorithm: Output: R R = η 1,1} {,,,,...,(, (( i N n n y x y x y x S!!! END END 1 0 ( I 1TO OR 0 0, = + = = = = + x y w w x w y n i w i i i i!!!!!!! η w!

24 [Example from Chris Bishop]

25 Generative Models: nn Assign each element to the closest cluster K-nearest neighbors score( c, d q = b Very easy to program Issues: choosing, b? Demo: c + d NN ( s( d d q q, d

26 NLP

Generative Models for Classification

Generative Models for Classification CS4780/5780 Machine Learning Fall 2014 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Generative vs. Discriminative