Introduction to NLP. Text Classification

Similar documents
Generative Models for Classification

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Organisation

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Text Classification and Naïve Bayes

Gaussian Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

naive bayes document classification

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Classifying Documents with Poisson Mixtures

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative Models for Discrete Data

Machine Learning for natural language processing

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Chapter 6 Classification and Prediction (2)

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

CSE446: Naïve Bayes Winter 2016

Probability Review and Naïve Bayes

Midterm: CS 6375 Spring 2018

CS 188: Artificial Intelligence. Machine Learning

Pattern Recognition and Machine Learning

CSE 546 Final Exam, Autumn 2013

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Classification Algorithms

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning Overview

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Machine Learning in a Nutshell. Data Model Performance Measure Machine Learner

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS540 ANSWER SHEET

The Bayes classifier

CS145: INTRODUCTION TO DATA MINING

Probabilistic Graphical Models

Click Prediction and Preference Ranking of RSS Feeds

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

COMS 4771 Introduction to Machine Learning. Nakul Verma

CS 188: Artificial Intelligence Spring Today

Classification Algorithms

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

On a New Model for Automatic Text Categorization Based on Vector Space Model

Machine Learning for Signal Processing Bayes Classification and Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Machine Learning Midterm Exam

Classification, Linear Models, Naïve Bayes

Numerical Learning Algorithms

Machine Learning (CS 567) Lecture 2

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

DATA MINING LECTURE 10

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning: Assignment 1

Natural Language Processing (CSEP 517): Text Classification

Interpreting Deep Classifiers

Machine Learning, Fall 2012 Homework 2

Classification & Information Theory Lecture #8

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Estimating Parameters

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

CSE546: Naïve Bayes Winter 2012

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Day 5: Generative models, structured classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Bayesian Methods: Naïve Bayes

Modern Information Retrieval

Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data.

Loss Functions, Decision Theory, and Linear Models

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Data Mining Part 4. Prediction

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Brief Introduction to Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Discriminant Analysis and Statistical Pattern Recognition

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Information Retrieval (Text Categorization)

Word Sense Disambiguation (WSD) Based on Foundations of Statistical NLP by C. Manning & H. Schütze, ch. 7 MIT Press, 2002

Introduction to Machine Learning Midterm Exam Solutions

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

CS6220: DATA MINING TECHNIQUES

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Transcription:

NLP

Introduction to NLP Text Classification

Classification Assigning documents to predefined categories topics, languages, users A given set of classes C Given x, determine its class in C Hierarchical vs. flat Overlapping (soft vs non-overlapping (hard

Classification Ideas: manual classification using rules e.g., Columbia AND University à Education Columbia AND South Carolina à Geography Popular techniques generative (-nn, Naïve Bayes vs. discriminative (SVM, regression Generative model joint prob p(x,y and use Bayesian prediction to compute p(y x Discriminative model p(y x directly.

Representations or Document Classification (And Clustering Typically: vector-based Words: cat, dog, etc. eatures: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed

Naïve Bayesian classifiers Naïve Bayesian classifier Assuming statistical independence eatures = words (or phrases typically, ( (,..., (,..., (,... 2 1 2 1 2 1 P C d P C d P C d P = = = = j j j j P C d P C d P C d P 1 1 2 1 ( ( (,..., (

Issues with Naïve Bayes Where do we get the values P ( d C use maximum lielihood estimation (N i /N Same for the conditionals these are based on a multinomial generator and the MLE estimator is (T ji /ΣT ji Smoothing is needed why Laplace smoothing ((T ji +1/Σ(T ji +1 Implementation how to avoid floating point underflow

Spam Recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 rom: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Subject: Gooday DEAR SIR UNDS OR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD O INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE O MY SEARCH OR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONIDENTIAL TRANSACTION INVOLVING THE! TRANSER O UND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAE OREIGN ACCOUNT

SpamAssassin http://spamassassin.apache.org/ http://spamassassin.apache.org/tests_3_3_x.html Examples: body Incorporates a tracing ID number body HTML and text parts are different header Date: is 3 to 6 hours before Received: date body HTML font size is huge header Attempt to obfuscate words in Subject: header Subject =~ /^urgent(?:[\s\w]*(dollar.{1,40} (?:alert response assistance proposal reply warning noti(?:ce fication greeting matter/i

eature Selection: The Χ 2 Test or a term t: I t 0 1 C 0 00 01 1 10 11 C=class, i t = feature Testing for independence: P(C=0,I t =0 should be equal to P(C=0 P(I t =0 P(C=0 = ( 00 + 01 /n P(C=1 = 1-P(C=0 = ( 10 + 11 /n P(I t =0 = ( 00 +K 10 /n P(I t =1 = 1-P(I t =0 = ( 01 + 11 /n

eature Selection: The Χ 2 Test Χ 2 = ( 11 + 10 n( 11 ( + 01 00 00 ( 10 11 + 01 2 01 ( 10 + 00 High values of Χ 2 indicate lower belief in independence. In practice, compute Χ 2 for all words and pic the top among them.

eature Selection: Mutual Information No document length scaling is needed Documents are assumed to be generated according to the multinomial model Measures amount of information: if the distribution is the same as the bacground distribution, then MI=0 X = word; Y = class MI( X, Y = P( x, ylog P( x, y x y P( x P( y

Well-nown Datasets 20 newsgroups http://qwone.com/~jason/20newsgroups/ Reuters-21578 http://www.daviddlewis.com/resources/testcollections/ reuters21578/ Cats: grain, acquisitions, corn, crude, wheat, trade WebKB http://www-2.cs.cmu.edu/~webb/ course, student, faculty, staff, project, dept, other RCV1 http://www.daviddlewis.com/resources/testcollections/rcv1/ Larger Reuters corpus

Evaluation Of Text Classification Microaveraging average over classes Macroaveraging uses pooled table

Vector Space Classification x2 topic2 topic1 x1

Decision Surfaces x2 topic2 topic1 x1

Decision Trees x2 topic2 topic1 x1

Linear Boundary x2 topic2 topic1 x1

Vector Space Classifiers Using centroids Boundary line that is equidistant from two centroids

Linear Separators Two-dimensional line: w 1 x 1 +w 2 x 2 =b is the linear separator w 1 x 1 +w 2 x 2 >b for the positive class In n-dimensional spaces:! w T! x = b

Decision Boundary x2 w! topic2 topic1 x1

Example Bias b=0 Document is A D E H Its score will be 0.6*1+0.4*1+0.4*1+(-0.5*1 = 0.9>0 w i x i w i x i 0.6 A -0.7 G 0.5 B -0.5 H 0.5 C -0.3 I 0.4 D -0.2 J 0.4 E -0.2 K 0.3-0.2 L

Perceptron Algorithm Input: Algorithm: Output: R R = η 1,1} {,,,,...,(, (( 1 1 1 i N n n y x y x y x S!!! END END 1 0 ( I 1TO OR 0 0, 1 0 + = + = = = = + x y w w x w y n i w i i i i!!!!!!! η w!

[Example from Chris Bishop]

Generative Models: nn Assign each element to the closest cluster K-nearest neighbors score( c, d q = b Very easy to program Issues: choosing, b? Demo: http://www-2.cs.cmu.edu/~zhuxj/courseproject/nndemo/knn.html c + d NN ( s( d d q q, d

NLP