Naïve Bayes, Maxent and Neural Models

Size: px

Start display at page:

Download "Naïve Bayes, Maxent and Neural Models"

Chad Gaines
5 years ago
Views:

1 Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP

2 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

3 Probabilistic Classification Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

4 Posterior Decoding: Probabilistic Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class observed data observation likelihood (averaged over all classes)

5 Noisy Channel Model Decode Rerank what I want to tell you sports what you actually see The Os lost again hypothesized intent sad stories sports reweight according to what s likely sports

6 Noisy Channel Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning possible (clean) output translation/ decode model (clean) language model observed (noisy) text observation (noisy) likelihood

7 Use Logarithms

8 Accuracy, Precision, and Recall Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

9 A combined measure: F Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1

10 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

11 The Bag of Words Representation

12 The Bag of Words Representation

13 The Bag of Words Representation 13

14 Bag of Words Representation seen 2 classifier sweet 1 γ( )=c whimsical 1 classifier recommend 1 happy

15 Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training?

16 Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training? A: generative training

17 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i

18 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter

19 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter Assume the feature probabilities are independent given the class X

20 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary

21 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j

22 Brill and Banko (2001) With enough data, the classifier may not matter

23 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j Calculate P(w k c j ) terms Text j = single doc containing all docs j For each word w k in Vocabulary n k = # of occurrences of w k in Text j p w k c j = class unigram LM

24 Naïve Bayes and Language Modeling Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides We use only word features we use all of the words in the text (not a subset) Then Naïve Bayes has an important similarity to language modeling

25 Sec Naïve Bayes as a Language Model Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I love 0.01 this fun 0.1 film

26 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love love 0.01 this 0.01 this 0.05 fun fun 0.1 film 0.1 film

27 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love 0.01 this love 0.01 this fun fun 0.1 film 0.1 film

28 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I love 0.01 this fun 0.1 film I love this fun film e-7 P(s pos) > P(s neg) 1e-9

29 Summary: Naïve Bayes is Not So Naïve Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

30 But: Naïve Bayes Isn t Without Issue Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there better ways of handling missing/noisy data? (automated, more principled)

31 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

32 Connections to Other Techniques Log-Linear Models as statistical regression based in information theory a form of viewed as to be cool today :) (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets

33 Maxent Models for Classification: Discriminatively or Generatively Trained Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier

34 Maximum Entropy (Log-linear) Models discriminatively trained: classify in one go

35 Maximum Entropy (Log-linear) Models generatively trained: learn to model language

36 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK

37 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK # killed: Type: Perp:

38 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

39 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

40 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

41 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

42 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

43 We need to score the different combinations. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

44 Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) score k (department, ATTACK) are all of these uncorrelated? COMBINE posterior probability of ATTACK

45 Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) COMBINE posterior probability of ATTACK Q: What are the score and combine functions for Naïve Bayes?

46 Scoring Our Possibilities Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. score(, ) = ATTACK score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)

47 Lesson 1

48 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. SNAP(score(, )) ATTACK

49 What function operates on any real number? is never less than 0?

50 What function operates on any real number? is never less than 0? f(x) = exp(x)

51 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. exp(score(, )) ATTACK

52 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)

53 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) Learn the scores (but we ll declare what combinations should be looked at)

54 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( weight 1 * occurs 1 (fatally shot, ATTACK) )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK)

55 Maxent Modeling: Feature Functions p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. weight 1 * occurs 1 (fatally shot, ATTACK) exp( )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be realvalued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary

56 More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued??? Non-templated count-valued

57 More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued occurs fatally shot, ATTACK = count fatally shot ATTACK) Non-templated count-valued

58 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) = ATTACK 1 Z Q: How do we define Z? weight 1 * applies 1 (fatally shot, ATTACK) exp( )) weight 2 * applies 2 (seriously wounded, ATTACK) weight 3 * applies 3 (Shining Path, ATTACK)

59 Σ label x Normalization for Classification Z = exp( weight 1 * occurs 1 (fatally shot, ATTACK) ) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) p x y) exp(θ f x, y ) classify doc y with label x in one go

60 Normalization for Language Model general class-based (X) language model of doc y

61 Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case

62 Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

63 Understanding Conditioning Is this a good language model?

64 Understanding Conditioning Is this a good language model?

65 Understanding Conditioning Is this a good language model? (no)

66 Understanding Conditioning Is this a good posterior classifier? (no)

67 Lesson 11

68 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

69 p θ (x y ) probabilistic model objective (given observations)

70 Objective = Full Likelihood? These values can have very small magnitude underflow Differentiating this product could be a pain

71 Logarithms (0, 1] (-, 0] Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Inverse of exp log(exp(x)) = x

72 Log-Likelihood Wide range of (negative) numbers Sums are more stable Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Differentiating this becomes nicer (even though Z depends on θ)

73 Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x Differentiating this becomes nicer (even though Z depends on θ)

74 Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ

75 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

76 How will we optimize F(θ)? Calculus

77 F(θ) θ

78 F(θ) θ* θ

79 F (θ) derivative of F wrt θ F(θ) θ* θ

80 Example F(x) = -(x-2) 2 differentiate F (x) = -2x + 4 x = 2 Solve F (x) = 0

81 Common Derivative Rules

82 What if you can t find the roots? Follow the derivative F (θ) derivative of F wrt θ F(θ) θ* θ

83 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) F (θ) derivative of F wrt θ y 0 F(θ) θ 0 θ* θ

84 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) F (θ) derivative of F wrt θ y 0 F(θ) g 0 θ 0 θ* θ

85 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 0 g 0 θ 0 θ 1 θ* F(θ) θ

86 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 θ 0 θ 1 θ 2 θ* F(θ) θ

87 What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ

88 What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ

89 Gradient = Multi-variable derivative K-dimensional input K-dimensional output

90 Gradient Ascent

91 Gradient Ascent

92 Gradient Ascent

93 Gradient Ascent

94 Gradient Ascent

95 Gradient Ascent

96 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

97 Expectations number of pieces of candy /6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

98 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

99 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

100 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

101 Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ Differentiating this becomes nicer (even though Z depends on θ)

102 Log-Likelihood Gradient Each component k is the difference between:

103 Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data

104 Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ thinks it computes for feature f k X' Yi

105 Lesson 6

106 Log-Likelihood Gradient Derivation y i

107 Log-Likelihood Gradient Derivation y i

108 Log-Likelihood Gradient Derivation y i use the (calculus) chain rule g log g(h θ ) = θ h(θ) h θ scalar p(x y i ) vector of functions

109 Log-Likelihood Gradient Derivation Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

110 Gradient Optimization Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F = f θ k x i, y i k i i y f k x i, y p y x i )

111 Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

112 Preventing Extreme Values Naïve Bayes Extreme values are 0 probabilities

113 Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values

114 Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values regularization

115 (Squared) L2 Regularization

116 Lesson 8

117 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

118 Revisiting the SNAP Function softmax

119 Revisiting the SNAP Function softmax

120 N-gram Language Models given some context w i-3 w i-2 w i-1 predict the next word w i

121 N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i

122 N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i

123 Maxent Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i

124 Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i

125 Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) can we learn word-specific weights (by type)? predict the next word w i

126 Neural Language Models given some context w i-3 w i-2 w i-1 create/use distributed representations e w e i-3 e i-2 e i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

127 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

128 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi

129 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

130 Baselines LM Name A Neural Probabilistic Language Model, Bengio et al. (2003) N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes

131 A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl No Yes No Yes 252

132 A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl No Yes No Yes 252 we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs) (Sect. 4.2)

Classification, Linear Models, Naïve Bayes

Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob Eisenstein Today Text classification problems and their evaluation Linear classifiers