Naïve Bayes, Maxent and Neural Models

Similar documents
Classification, Linear Models, Naïve Bayes

Directed Probabilistic Graphical Models CMSC 678 UMBC

Text Classification and Naïve Bayes

Probability Review and Naïve Bayes

Midterm sample questions

Machine Learning for natural language processing

Lecture 2: Probability, Naive Bayes

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Loss Functions, Decision Theory, and Linear Models

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Mining Classification Knowledge

A fast and simple algorithm for training neural probabilistic language models

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

From perceptrons to word embeddings. Simon Šuster University of Groningen

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Maxent Models and Discriminative Estimation

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression & Neural Networks

CSC 411: Lecture 09: Naive Bayes

Lecture 5 Neural models for NLP

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

The Naïve Bayes Classifier. Machine Learning Fall 2017

CS145: INTRODUCTION TO DATA MINING

Statistical NLP Spring A Discriminative Approach

Support Vector Machines

naive bayes document classification

Mining Classification Knowledge

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Neural Network Language Modeling

Generative Clustering, Topic Modeling, & Bayesian Inference

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Linear Models for Classification

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. Machine Learning Fall 2018

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Behavioral Data Mining. Lecture 2

CPSC 340: Machine Learning and Data Mining

Nonparametric Bayesian Methods (Gaussian Processes)

Support Vector Machines

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Introduction to Machine Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Introduction to Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Naïve Bayes classification

The Noisy Channel Model and Markov Models

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Bayesian Methods: Naïve Bayes

Inf2b Learning and Data

Week 5: Logistic Regression & Neural Networks

Neural Networks Language Models

Conditional Language Modeling. Chris Dyer

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

ECE 5984: Introduction to Machine Learning

Natural Language Processing (CSEP 517): Text Classification

The Bayes classifier

CSC321 Lecture 7 Neural language models

MLPR: Logistic Regression and Neural Networks

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Applied Natural Language Processing

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Introduction to Machine Learning

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Ranked Retrieval (2)

Classification & Information Theory Lecture #8

Generative v. Discriminative classifiers Intuition

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Algorithm. Heejun Kim

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 2: Logistic Regression and Neural Networks

Stochastic Gradient Descent

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSCI-567: Machine Learning (Spring 2019)

A brief introduction to Conditional Random Fields

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Natural Language Processing. Statistical Inference: n-grams

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Transcription:

Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

Probabilistic Classification Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

Posterior Decoding: Probabilistic Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class observed data observation likelihood (averaged over all classes)

Noisy Channel Model Decode Rerank what I want to tell you sports what you actually see The Os lost again hypothesized intent sad stories sports reweight according to what s likely sports

Noisy Channel Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning possible (clean) output translation/ decode model (clean) language model observed (noisy) text observation (noisy) likelihood

Use Logarithms

Accuracy, Precision, and Recall Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

A combined measure: F Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

The Bag of Words Representation

The Bag of Words Representation

The Bag of Words Representation 13

Bag of Words Representation seen 2 classifier sweet 1 γ( )=c whimsical 1 classifier recommend 1 happy 1......

Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training?

Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training? A: generative training

Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i

Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter

Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter Assume the feature probabilities are independent given the class X

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j

Brill and Banko (2001) With enough data, the classifier may not matter

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j Calculate P(w k c j ) terms Text j = single doc containing all docs j For each word w k in Vocabulary n k = # of occurrences of w k in Text j p w k c j = class unigram LM

Naïve Bayes and Language Modeling Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides We use only word features we use all of the words in the text (not a subset) Then Naïve Bayes has an important similarity to language modeling

Sec.13.2.1 Naïve Bayes as a Language Model Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love 0.001 love 0.01 this 0.01 this 0.05 fun 0.005 fun 0.1 film 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love 0.01 this 0.001 love 0.01 this 0.1 0.2 0.1 0.01 0.05 0.001 0.01 0.005 0.1 0.1 0.05 fun 0.005 fun 0.1 film 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film I love this fun film 0.1 0.1 0.01 0.05 0.1 0.2 0.001 0.01 0.005 0.1 5e-7 P(s pos) > P(s neg) 1e-9

Summary: Naïve Bayes is Not So Naïve Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

But: Naïve Bayes Isn t Without Issue Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there better ways of handling missing/noisy data? (automated, more principled)

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

Connections to Other Techniques Log-Linear Models as statistical regression based in information theory a form of viewed as to be cool today :) (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets

Maxent Models for Classification: Discriminatively or Generatively Trained Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier

Maximum Entropy (Log-linear) Models discriminatively trained: classify in one go

Maximum Entropy (Log-linear) Models generatively trained: learn to model language

Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK

Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK # killed: Type: Perp:

Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

We need to score the different combinations. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) score k (department, ATTACK) are all of these uncorrelated? COMBINE posterior probability of ATTACK

Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) COMBINE posterior probability of ATTACK Q: What are the score and combine functions for Naïve Bayes?

Scoring Our Possibilities Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. score(, ) = ATTACK score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/bqcdh9 Lesson 1

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. SNAP(score(, )) ATTACK

What function operates on any real number? is never less than 0?

What function operates on any real number? is never less than 0? f(x) = exp(x)

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. exp(score(, )) ATTACK

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) Learn the scores (but we ll declare what combinations should be looked at)

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( weight 1 * occurs 1 (fatally shot, ATTACK) )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK)

Maxent Modeling: Feature Functions p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. weight 1 * occurs 1 (fatally shot, ATTACK) exp( )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be realvalued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary

More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued??? Non-templated count-valued

More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued occurs fatally shot, ATTACK = count fatally shot ATTACK) Non-templated count-valued

Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) = ATTACK 1 Z Q: How do we define Z? weight 1 * applies 1 (fatally shot, ATTACK) exp( )) weight 2 * applies 2 (seriously wounded, ATTACK) weight 3 * applies 3 (Shining Path, ATTACK)

Σ label x Normalization for Classification Z = exp( weight 1 * occurs 1 (fatally shot, ATTACK) ) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) p x y) exp(θ f x, y ) classify doc y with label x in one go

Normalization for Language Model general class-based (X) language model of doc y

Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case

Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

Understanding Conditioning Is this a good language model?

Understanding Conditioning Is this a good language model?

Understanding Conditioning Is this a good language model? (no)

Understanding Conditioning Is this a good posterior classifier? (no)

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/bqcdh9 Lesson 11

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

p θ (x y ) probabilistic model objective (given observations)

Objective = Full Likelihood? These values can have very small magnitude underflow Differentiating this product could be a pain

Logarithms (0, 1] (-, 0] Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Inverse of exp log(exp(x)) = x

Log-Likelihood Wide range of (negative) numbers Sums are more stable Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Differentiating this becomes nicer (even though Z depends on θ)

Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x Differentiating this becomes nicer (even though Z depends on θ)

Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

How will we optimize F(θ)? Calculus

F(θ) θ

F(θ) θ* θ

F (θ) derivative of F wrt θ F(θ) θ* θ

Example F(x) = -(x-2) 2 differentiate F (x) = -2x + 4 x = 2 Solve F (x) = 0

Common Derivative Rules

What if you can t find the roots? Follow the derivative F (θ) derivative of F wrt θ F(θ) θ* θ

What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) F (θ) derivative of F wrt θ y 0 F(θ) θ 0 θ* θ

What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) F (θ) derivative of F wrt θ y 0 F(θ) g 0 θ 0 θ* θ

What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 0 g 0 θ 0 θ 1 θ* F(θ) θ

What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 θ 0 θ 1 θ 2 θ* F(θ) θ

What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ

What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ

Gradient = Multi-variable derivative K-dimensional input K-dimensional output

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ Differentiating this becomes nicer (even though Z depends on θ)

Log-Likelihood Gradient Each component k is the difference between:

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ thinks it computes for feature f k X' Yi

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/bqcdh9 Lesson 6

Log-Likelihood Gradient Derivation y i

Log-Likelihood Gradient Derivation y i

Log-Likelihood Gradient Derivation y i use the (calculus) chain rule g log g(h θ ) = θ h(θ) h θ scalar p(x y i ) vector of functions

Log-Likelihood Gradient Derivation Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

Gradient Optimization Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F = f θ k x i, y i k i i y f k x i, y p y x i )

Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

Preventing Extreme Values Naïve Bayes Extreme values are 0 probabilities

Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values

Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values regularization

(Squared) L2 Regularization

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/ https://goo.gl/bqcdh9 Lesson 8

Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

Revisiting the SNAP Function softmax

Revisiting the SNAP Function softmax

N-gram Language Models given some context w i-3 w i-2 w i-1 predict the next word w i

N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i

N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i

Maxent Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i

Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i

Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) can we learn word-specific weights (by type)? predict the next word w i

Neural Language Models given some context w i-3 w i-2 w i-1 create/use distributed representations e w e i-3 e i-2 e i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

Baselines LM Name A Neural Probabilistic Language Model, Bengio et al. (2003) N- gram Params. Test Ppl. Interpolation 3 --- 336 Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff 3 --- 323 5 --- 321 3 5 500 classes 500 classes 312 312

A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation 3 --- 336 Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff 3 --- 323 5 --- 321 3 5 500 classes 500 classes 312 312 NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252

A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation 3 --- 336 Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff 3 --- 323 5 --- 321 3 5 500 classes 500 classes 312 312 NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252 we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs) (Sect. 4.2)