Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

Probabilistic Classification Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

Posterior Decoding: Probabilistic Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class observed data observation likelihood (averaged over all classes)

Noisy Channel Model Decode Rerank what I want to tell you sports what you actually see The Os lost again hypothesized intent sad stories sports reweight according to what s likely sports

Noisy Channel Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning possible (clean) output translation/ decode model (clean) language model observed (noisy) text observation (noisy) likelihood

Use Logarithms

Accuracy, Precision, and Recall Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

A combined measure: F Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1

The Bag of Words Representation

The Bag of Words Representation 13

Bag of Words Representation seen 2 classifier sweet 1 γ( )=c whimsical 1 classifier recommend 1 happy 1......

Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training?

Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training? A: generative training

Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i

Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter

Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter Assume the feature probabilities are independent given the class X

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j

Brill and Banko (2001) With enough data, the classifier may not matter

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j Calculate P(w k c j ) terms Text j = single doc containing all docs j For each word w k in Vocabulary n k = # of occurrences of w k in Text j p w k c j = class unigram LM

Naïve Bayes and Language Modeling Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides We use only word features we use all of the words in the text (not a subset) Then Naïve Bayes has an important similarity to language modeling

Sec.13.2.1 Naïve Bayes as a Language Model Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love 0.001 love 0.01 this 0.01 this 0.05 fun 0.005 fun 0.1 film 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love 0.01 this 0.001 love 0.01 this 0.1 0.2 0.1 0.01 0.05 0.001 0.01 0.005 0.1 0.1 0.05 fun 0.005 fun 0.1 film 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film I love this fun film 0.1 0.1 0.01 0.05 0.1 0.2 0.001 0.01 0.005 0.1 5e-7 P(s pos) > P(s neg) 1e-9

Summary: Naïve Bayes is Not So Naïve Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

But: Naïve Bayes Isn t Without Issue Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there better ways of handling missing/noisy data? (automated, more principled)

Connections to Other Techniques Log-Linear Models as statistical regression based in information theory a form of viewed as to be cool today :) (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets

Maxent Models for Classification: Discriminatively or Generatively Trained Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier

Maximum Entropy (Log-linear) Models discriminatively trained: classify in one go

Maximum Entropy (Log-linear) Models generatively trained: learn to model language

Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK

We need to score the different combinations. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) score k (department, ATTACK) are all of these uncorrelated? COMBINE posterior probability of ATTACK

Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) COMBINE posterior probability of ATTACK Q: What are the score and combine functions for Naïve Bayes?

Scoring Our Possibilities Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. score(, ) = ATTACK score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/bqcdh9 Lesson 1

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. SNAP(score(, )) ATTACK

What function operates on any real number? is never less than 0?

What function operates on any real number? is never less than 0? f(x) = exp(x)

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) Learn the scores (but we ll declare what combinations should be looked at)

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( weight 1 * occurs 1 (fatally shot, ATTACK) )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK)

Maxent Modeling: Feature Functions p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. weight 1 * occurs 1 (fatally shot, ATTACK) exp( )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be realvalued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary

More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued??? Non-templated count-valued

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) = ATTACK 1 Z Q: How do we define Z? weight 1 * applies 1 (fatally shot, ATTACK) exp( )) weight 2 * applies 2 (seriously wounded, ATTACK) weight 3 * applies 3 (Shining Path, ATTACK)

Σ label x Normalization for Classification Z = exp( weight 1 * occurs 1 (fatally shot, ATTACK) ) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) p x y) exp(θ f x, y ) classify doc y with label x in one go

Normalization for Language Model general class-based (X) language model of doc y

Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case

Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

Understanding Conditioning Is this a good language model?

Understanding Conditioning Is this a good language model? (no)

Understanding Conditioning Is this a good posterior classifier? (no)

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/bqcdh9 Lesson 11

p θ (x y ) probabilistic model objective (given observations)

Objective = Full Likelihood? These values can have very small magnitude underflow Differentiating this product could be a pain

Logarithms (0, 1] (-, 0] Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Inverse of exp log(exp(x)) = x

Log-Likelihood Wide range of (negative) numbers Sums are more stable Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Differentiating this becomes nicer (even though Z depends on θ)

Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x Differentiating this becomes nicer (even though Z depends on θ)

Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ

How will we optimize F(θ)? Calculus

F(θ) θ

F(θ) θ* θ

F (θ) derivative of F wrt θ F(θ) θ* θ

Example F(x) = -(x-2) 2 differentiate F (x) = -2x + 4 x = 2 Solve F (x) = 0

Common Derivative Rules

What if you can t find the roots? Follow the derivative F (θ) derivative of F wrt θ F(θ) θ* θ

What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) F (θ) derivative of F wrt θ y 0 F(θ) θ 0 θ* θ

What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) F (θ) derivative of F wrt θ y 0 F(θ) g 0 θ 0 θ* θ

What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 0 g 0 θ 0 θ 1 θ* F(θ) θ

What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ

Gradient = Multi-variable derivative K-dimensional input K-dimensional output

Gradient Ascent

Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ Differentiating this becomes nicer (even though Z depends on θ)

Log-Likelihood Gradient Each component k is the difference between:

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ thinks it computes for feature f k X' Yi

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/bqcdh9 Lesson 6

Log-Likelihood Gradient Derivation y i

Log-Likelihood Gradient Derivation y i use the (calculus) chain rule g log g(h θ ) = θ h(θ) h θ scalar p(x y i ) vector of functions

Log-Likelihood Gradient Derivation Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

Gradient Optimization Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F = f θ k x i, y i k i i y f k x i, y p y x i )

Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

Preventing Extreme Values Naïve Bayes Extreme values are 0 probabilities

Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values

Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values regularization

(Squared) L2 Regularization

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/bqcdh9 Lesson 8

Revisiting the SNAP Function softmax

N-gram Language Models given some context w i-3 w i-2 w i-1 predict the next word w i

N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i

Maxent Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i

Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i

Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) can we learn word-specific weights (by type)? predict the next word w i

Neural Language Models given some context w i-3 w i-2 w i-1 create/use distributed representations e w e i-3 e i-2 e i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

Baselines LM Name A Neural Probabilistic Language Model, Bengio et al. (2003) N- gram Params. Test Ppl. Interpolation 3 --- 336 Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff 3 --- 323 5 --- 321 3 5 500 classes 500 classes 312 312

A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation 3 --- 336 Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff 3 --- 323 5 --- 321 3 5 500 classes 500 classes 312 312 NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252