ML4NLP Introduction to Classification

Size: px

Start display at page:

Download "ML4NLP Introduction to Classification"

Sophie Potter
6 years ago
Views:

1 ML4NLP Introducton to Classfcaton CS 590NLP Dan Goldwasser Purdue Unversty

2 Statstcal Language Modelng Intuton: by lookng at large quanttes of text we can fnd statstcal regulartes Dstngush between correct and ncorrect sentences Language models defne a probablty dstrbuton over strngs (e.g., sentences) n a language. We can use language model to score and rank sentences I don t know {whether,weather} to laugh or cry P( I don t.. weather to laugh.. ) >< P( I don t.. whether to laugh.. )

3 Language Modelng wth N-grams Ungram m o del Bgram m o del Trgram m o del P (w)p (w2)...p(w) P (w)p (w2 w)...p(w w - ) P (w)p (w2 w)...p(w w-2 w-)

4 Evaluatng Language Models Assumng that we have a language model, how can we tell f t s good? Opton : try to generate Shakespeare.. Ths s know as Qualtatve evaluaton Opton 2: Quanttatve evaluaton Opton 2.: See how well you do on Spellng correcton Ths s known as Extrnsc Evaluaton Opton 2.2: Fnd an ndependent measure for LM qualty Ths s known as Intrnsc Evaluaton

5 When are LM applcable? Fndng regularty n language s surprsngly useful! Easy example: weather/whether But also- Translaton (can you produce legal French from source Englsh?) Capton Generaton (combne output of vsual sensors nto a grammatcal sentence) Deep Vsual-Semantc Algnments for Generatng Image Descrptons

6 Classfcaton A fundamental machne learnng tool Wdely applcable n NLP Supervsed learnng: Learner s gven a collecton of labeled documents Emals: Spam/not spam; Revews: Pos/Neg Buld a functon mappng documents to labels Key property: Generalzaton functon should work well on new data

7 Sentment Analyss Dude, I ust watched ths horror flck! Sellng ponts: nghtmares scenes, torture scenes, terrble monsters that was so bad a##! Don t buy the popcorn t was terrble, the monsters sellng t must have wanted to torture me, t was so bad t gave me nghtmares! What should your learnng algorthm look at?

8 Deceptve Revews What should your learnng algorthm look at? Fndng Deceptve Opnon Spam by Any Stretch of the Imagnaton. Ott etal. ACL 20

9 Power Relatons Blah Unaccep table blah Your honor, I agree blah blah blah What should your learnng algorthm look at? Echoes of Power: Language Effects and Power Dfferences n Socal Interacton. Danescu-Nculescu-Mzl et-al. WWW 202.

10 Power Relatons Communcatve behavors are patterned and coordnated, lke a dance [Nederhoffer and Pennebaker 2002] Echoes of Power: Language Effects and Power Dfferences n Socal Interacton. Danescu- Nculescu-Mzl et-al. WWW 202.

11 Classfcaton We assume we have a labeled dataset. How can we buld a classfer? Decde on a representaton and a learnng algorthm Essentally: Functon approxmaton Representaton: What s the doman of the functon Learnng: How to fnd a good approxmaton We wll look nto several smple examples Naïve Bayes, Perceptron Let s start wth some defntons..

12 Basc Defntons Gven: D a set of labeled examples {<x,y>} Goal: Learn a functon f(x) s.t. f(x) y Note: y can be bnary, or categorcal Typcally the nput x s represented as a vector of features Break D nto three parts: Tranng set (used by the learnng algorthm) Test set (evaluate the learned model) Development set (tunng the learnng algorthm) Evaluaton: performance measure over the test set Accuracy: proporton of correct predctons (test data)

13 Precson and Recall Gven a dataset, we tran a classfer that gets 99% accuracy Dd we do a good ob? Buld a classfer for bran tumor: 99.9% of bran scans do not show sgns of tumor Dd we do a good ob? By smply sayng NO to all examples we reduce the error by a factor of 0! Clearly Accuracy s not the best way to evaluate the learnng system when the data s heavly skewed! Intuton: we need a measure that captures the class we care about! (rare) 3

14 Precson and Recall The learner can make two knds of mstakes: False Postve False Negatve Precson: when we predcted the rare class, how often are we rght? Recall Predc ted: Predc ted: True Pos Predcted Pos True Label: True Postve False Negatve True Label: False Postve True Negatve Out of all the nstances of the rare class, how many dd we catch? True Pos Actual Pos 0 0 True Pos True Pos + False Pos True Pos True Pos + False Neg 4

15 F-Score Precson and Recall gve us two reference ponts to compare learnng performance Whch algorthm s better? Opton : Average Opton 2: F-Score Precson Recall Average F Score Algorthm Algorthm Algorthm P + R 2 2 PR P + R We need a sngle score Propertes of f-score: Ranges between 0- Prefers precson and recall wth smlar values 5

16 Smple Example: Naïve Bayes Naïve Bayes: smple probablstc classfer Gven a set of labeled data: Documents D, each assocated wth a label v Smple feature representaton: BoW Learnng: Construct a probablty dstrbuton P(v d) Predcton: Assgn the label wth the hghest probablty Reles on strong smplfyng assumptons

17 Smple Representaton: BoW Basc dea: (sentment analyss) I loved ths move, t s awesome! I couldn t stop laughng for two hours! Mappng nput to label can be done by representng the frequences of ndvdual words Document word counts Smple, yet surprsngly powerful representaton!

18 Bayes Rule Naïve Bayes s a smple probablstc classfcaton method, based on Bayes rule. P(v d) P(d v) P(v) P(d)

19 Bascs of Naïve Bayes P(v) - the pror probablty of a label v Reflects background knowledge; before data s observed. If no nformaton - unform dstrbuton. P(D) - The probablty that ths sample of the Data s observed. (No knowledge of the label) P(D v): The probablty of observng the sample D, gven that the label v s the target (Lkelhood) P(v D): The posteror probablty of v. The probablty that v s the target, gven that D has been observed. 9

20 Bayes Rule Naïve Bayes s a smple classfcaton method, based on Bayes rule. P(v d) P(d v) P(v) P(d) Check your ntuton: P(v d) ncreases wth P(v) and wth P(d v) P(v d) decreases wth P(d)

21 Naïve Bayes P(v D) P(D v) P(v)/P(D) The learner consders a set of canddate labels, and attempts to fnd the most probable one v V, gven the observed data. Such maxmally probable assgnment s called maxmum a posteror assgnment (MAP); Bayes theorem s used to compute t: v MAP argmax v V P(v D) argmax v v P(D v) P(v)/P(D) argmax v V P(D v) P(v) Snce P(D) s the same for all v V

22 Naïve Bayes How can we compute P(v D)? Basc dea: represent document as a set of features, such as BoW features v MAP argmax v VP(v x) argmax v V P(v x,x 2,...,x n ) P(x v MAP argmax,x 2,...,x n v )P(v ) v V P(x,x 2,...,x n ) argmax v VP(x,x 2,...,x n v )P(v )

23 NB: Parameter Estmaton V MAP argmax v P(x, x 2,, x n v )P(v) Gven tranng data we can estmate the two terms Estmatng P(v) s easy. For each value v count how many tmes t appears n the tranng data. Queston: Assume bnary x s. How many parameters does the model requre? However, t s not feasble to estmate P(x,, x n v ) In ths case we have to estmate, for each target value, the probablty of each nstance (most of whch wll not occur) In order to use a Bayesan classfers n practce, we need to make assumptons that wll allow us to estmate these quanttes. 23

24 NB: Independence Assumpton Bag of words representaton: Word poston can be gnored Condtonal Independence: Assume feature probabltes are ndependent gven the label P(x v ) P(x x - ; v ) Both assumptons are not true Help smplfy the model Smple models work well

25 Nave Bayes V MAP argmax v P(x, x 2,, x n v )P(v) P(x,x 2,...,x n v ) P(x x 2,...,x n, v )P(x 2,...,x n v ) P(x x 2,...,x n, v )P(x 2 x 3,...,x n, v )P(x 3,...,x n v )... P(x x 2,...,x n, v )P(x 2 x 3,...,x n, v )P(x 3 x 4,...,x n, v )...P(x n v ) Assumpton: feature values are ndependent gven the target value n P(x v ) 25

26 Estmatng Probabltes (MLE) Assume a document classfcaton problem, usng word features v NB argmax v {lke,dslke} word v) P(v) P(x P(word k # (word k appears ntranng n v documents) nk P(word k v) #(v documents) n Sparsty of data s a problem -- f n s small, the estmate s not accurate -- f n s 0, t wll domnate the estmate: we wll never predct v k f a word that never appeared n tranng (wth v) appears n the test data v) How do we estmate? 26

27 Robust Estmaton of Probabltes v NB argmax v {lke,dslke} word v) P(v) P(x Ths process s called smoothng. There are many ways to do t, some better ustfed than others; An emprcal ssue. P(x k Here: n k s #(of occurrences of the word n the presence of v) n s #(of occurrences of the label v) p s a pror estmate of v (e.g., unform) m s equvalent sample sze (# of labels) Laplace Rule: for the Boolean case, p/2, m2 v) n k n + + mp m P(x k v) n k + n

28 Naïve Bayes Very easy to mplement Converges very quckly Learnng s ust countng Performs well n practce Appled to many document classfcaton tasks If data set s small, NB can perform better than sophstcated algorthms Strong ndependence assumptons If assumptons hold: NB s the optmal classfer Even f not, can perform well Next: from NB to learnng lnear threshold functons

29 Naïve Bayes: Two Classes Notce that the naïve Bayes method gves a method for predctng rather than an explct classfer In the case of two classes, v {0,} we predct that v ff: P(v P(v ) 0) n P(x n P(x v v ) 0) > 29

30 Naïve Bayes: Two Classes Notce that the naïve Bayes method gves a method for predctng rather than an explct classfer. In the case of two classes, v {0,} we predct that v ff: P(v P(v ) 0) n P(x n P(x v v ) 0) > Denote: p P(x v ), q P(x v 0) P(v ) P(v 0) n n p x (- p ) -x q x (- q ) -x > 30

31 Naïve Bayes: Two Classes In the case of two classes, v {0,} we predct that v ff: P(v P(v ) 0) n n p q x x (- q (- p (- p ) ) -x -x P(v P(v ) 0) n n (- q p )( - p q )( - q ) ) x x > 3

32 Naïve Bayes: Two Classes In the case of two classes, v {0,} we predct that v ff: P(v P(v ) 0) n n p q x x (- q (- p (- p ) ) -x -x P(v P(v ) 0) n n (- q p )( - p q )( - q ) ) x x > Take logarthm; we predct v ff : log P(v P(v ) 0) + log - p - q + p (log - p log q - q )x > 0 32

33 Naïve Bayes: Two Classes In the case of two classes, v {0,} we predct that v ff: P(v P(v ) 0) n n p q x x (- q (- p (- p ) ) -x -x P(v P(v ) 0) n n (- q p )( - p q )( - q ) ) x x > Take logarthm; we predct v ff : log P(v P(v ) 0) + log - p - q + p (log - p log q - q )x > 0 We get that nave Bayes s a lnear separator wth : w log p log q log p - q - p - q q - p f p q then w 0 and the feature s rrelevant Introducton to Machne Learnng. Fall

34 Lnear Classfers Lnear threshold functons Assocate a weght (w ) wth each feature (x ) Predcton: sgn(b + w T x) sgn (b + Σ w x ) b + w T x 0 predct y Otherwse, predct y- NB s a lnear threshold functon Weght vector (w) s assgned by computng condtonal probabltes In fact, Lnear threshold functons are a very popular representaton!

35 Lnear Classfers sgn(b + w T x) Each pont n ths space s a document The coordnates (e.g., x,x2), are determned by feature actvatons

36 Expressvty Lnear functons are qute expressve Exsts a lnear functon that s consstent wth the data A famous negatve examples (XOR):

37 Expressvty By transformng the feature space these functons can be made lnear Represent each pont n 2D as (x,x 2 )

38 Expressvty sgn(b + w T x) More realstc scenaro: the data s almost lnearly separable, except for some nose

39 Features So far we have dscussed BoW representaton In fact, you can use a very rch representaton Broader defnton Functons mappng attrbutes of the nput to a Boolean/categorcal/numerc value φ (x) x s captalzed 0 otherwse x contans ''good '' more than twce φ k (x) 0 otherwse Queston: assume that you have a lexcon, contanng postve and negatve sentment words. How can you use t to mprove over BoW?

40 Perceptron One of the earlest learnng algorthms Introduced by Rosenblatt 958 to model neural learnng Goal: drectly search for a separatng hyperplane If one exsts, perceptron wll fnd t If not, Onlne algorthm Consders one example at a tme (NB looks at entre data) Error drven algorthm Updates the weghts only when a mstake s made

41 Perceptron Intuton

42 Perceptron We learn f:xà {-,+} represented as f sgn{wx) Where X {0,} n or X R n and w ² R n Gven Labeled examples: {(x, y ), (x 2, y 2 ), (x m, y m )}. Intalze w0 R n 2. Cycle through all examples a. Predct the label of nstance x to be y sgn{wx) b. If y y, update the weght vector: w w + r y x (r - a constant, learnng rate) Otherwse, f y y, leave weghts unchanged. 42

43 Margn The margn of a hyperplane for a dataset s the dstance between the hyperplane and the data pont nearest to t.

44 Margn The margn of a hyperplane for a dataset s the dstance between the hyperplane and the data pont nearest to t. The margn of a data set (γ) s the maxmum margn possble for that dataset usng any weght vector.

45 Mstake Bound for Perceptron Let D{(x, y )} be a labeled dataset that s separable Let x < R for all examples. Let γ be the margn of the dataset D. Then, the perceptron algorthm wll make at most R 2 / γ 2 mstakes on the data.

46 Practcal Example Task: context senstve spellng {prncple, prncpal},{weather,whet her}. Source: Scalng to very very large corpora for natural language dsambguaton Mchele Banko, Erc Brll. Mcrosoft Research, Redmond, WA

47 Deceptve Revews What should your learnng algorthm look at? Fndng Deceptve Opnon Spam by Any Stretch of the Imagnaton. Ott etal. ACL 20

48 Decepton Classfcaton

49 Summary Classfcaton s a basc tool for NLP E.g., What s the topc of a document? Classfer: mappng from nput to label Label: Bnary or Categorcal We saw two smple learnng algorthms for fndng the parameters of lnear classfcaton functons Naïve Bayes and Perceptron Next: More sophstcated algorthms Applcatons (or how to get t to work!)

50 Questons?

Evaluation for sets of classes

Evaluation for sets of classes Evaluaton for Tet Categorzaton Classfcaton accuracy: usual n ML, the proporton of correct decsons, Not approprate f the populaton rate of the class s low Precson, Recall and F 1 Better measures 21 Evaluaton