Naïve Bayes, Maxent and Neural Models

Size: px
Start display at page:

Download "Naïve Bayes, Maxent and Neural Models"

Transcription

1 Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP

2 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

3 Probabilistic Classification Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

4 Posterior Decoding: Probabilistic Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class observed data observation likelihood (averaged over all classes)

5 Noisy Channel Model Decode Rerank what I want to tell you sports what you actually see The Os lost again hypothesized intent sad stories sports reweight according to what s likely sports

6 Noisy Channel Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning possible (clean) output translation/ decode model (clean) language model observed (noisy) text observation (noisy) likelihood

7 Use Logarithms

8 Accuracy, Precision, and Recall Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

9 A combined measure: F Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1

10 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

11 The Bag of Words Representation

12 The Bag of Words Representation

13 The Bag of Words Representation 13

14 Bag of Words Representation seen 2 classifier sweet 1 γ( )=c whimsical 1 classifier recommend 1 happy

15 Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training?

16 Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training? A: generative training

17 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i

18 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter

19 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter Assume the feature probabilities are independent given the class X

20 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary

21 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j

22 Brill and Banko (2001) With enough data, the classifier may not matter

23 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j Calculate P(w k c j ) terms Text j = single doc containing all docs j For each word w k in Vocabulary n k = # of occurrences of w k in Text j p w k c j = class unigram LM

24 Naïve Bayes and Language Modeling Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides We use only word features we use all of the words in the text (not a subset) Then Naïve Bayes has an important similarity to language modeling

25 Sec Naïve Bayes as a Language Model Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I love 0.01 this fun 0.1 film

26 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love love 0.01 this 0.01 this 0.05 fun fun 0.1 film 0.1 film

27 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love 0.01 this love 0.01 this fun fun 0.1 film 0.1 film

28 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I love 0.01 this fun 0.1 film I love this fun film e-7 P(s pos) > P(s neg) 1e-9

29 Summary: Naïve Bayes is Not So Naïve Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

30 But: Naïve Bayes Isn t Without Issue Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there better ways of handling missing/noisy data? (automated, more principled)

31 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

32 Connections to Other Techniques Log-Linear Models as statistical regression based in information theory a form of viewed as to be cool today :) (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets

33 Maxent Models for Classification: Discriminatively or Generatively Trained Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier

34 Maximum Entropy (Log-linear) Models discriminatively trained: classify in one go

35 Maximum Entropy (Log-linear) Models generatively trained: learn to model language

36 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK

37 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK # killed: Type: Perp:

38 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

39 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

40 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

41 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

42 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

43 We need to score the different combinations. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK

44 Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) score k (department, ATTACK) are all of these uncorrelated? COMBINE posterior probability of ATTACK

45 Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) COMBINE posterior probability of ATTACK Q: What are the score and combine functions for Naïve Bayes?

46 Scoring Our Possibilities Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. score(, ) = ATTACK score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)

47 Lesson 1

48 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. SNAP(score(, )) ATTACK

49 What function operates on any real number? is never less than 0?

50 What function operates on any real number? is never less than 0? f(x) = exp(x)

51 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. exp(score(, )) ATTACK

52 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)

53 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) Learn the scores (but we ll declare what combinations should be looked at)

54 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( weight 1 * occurs 1 (fatally shot, ATTACK) )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK)

55 Maxent Modeling: Feature Functions p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. weight 1 * occurs 1 (fatally shot, ATTACK) exp( )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be realvalued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary

56 More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued??? Non-templated count-valued

57 More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued occurs fatally shot, ATTACK = count fatally shot ATTACK) Non-templated count-valued

58 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) = ATTACK 1 Z Q: How do we define Z? weight 1 * applies 1 (fatally shot, ATTACK) exp( )) weight 2 * applies 2 (seriously wounded, ATTACK) weight 3 * applies 3 (Shining Path, ATTACK)

59 Σ label x Normalization for Classification Z = exp( weight 1 * occurs 1 (fatally shot, ATTACK) ) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) p x y) exp(θ f x, y ) classify doc y with label x in one go

60 Normalization for Language Model general class-based (X) language model of doc y

61 Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case

62 Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

63 Understanding Conditioning Is this a good language model?

64 Understanding Conditioning Is this a good language model?

65 Understanding Conditioning Is this a good language model? (no)

66 Understanding Conditioning Is this a good posterior classifier? (no)

67 Lesson 11

68 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

69 p θ (x y ) probabilistic model objective (given observations)

70 Objective = Full Likelihood? These values can have very small magnitude underflow Differentiating this product could be a pain

71 Logarithms (0, 1] (-, 0] Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Inverse of exp log(exp(x)) = x

72 Log-Likelihood Wide range of (negative) numbers Sums are more stable Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Differentiating this becomes nicer (even though Z depends on θ)

73 Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x Differentiating this becomes nicer (even though Z depends on θ)

74 Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ

75 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

76 How will we optimize F(θ)? Calculus

77 F(θ) θ

78 F(θ) θ* θ

79 F (θ) derivative of F wrt θ F(θ) θ* θ

80 Example F(x) = -(x-2) 2 differentiate F (x) = -2x + 4 x = 2 Solve F (x) = 0

81 Common Derivative Rules

82 What if you can t find the roots? Follow the derivative F (θ) derivative of F wrt θ F(θ) θ* θ

83 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) F (θ) derivative of F wrt θ y 0 F(θ) θ 0 θ* θ

84 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) F (θ) derivative of F wrt θ y 0 F(θ) g 0 θ 0 θ* θ

85 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 0 g 0 θ 0 θ 1 θ* F(θ) θ

86 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 θ 0 θ 1 θ 2 θ* F(θ) θ

87 What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ

88 What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ

89 Gradient = Multi-variable derivative K-dimensional input K-dimensional output

90 Gradient Ascent

91 Gradient Ascent

92 Gradient Ascent

93 Gradient Ascent

94 Gradient Ascent

95 Gradient Ascent

96 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

97 Expectations number of pieces of candy /6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

98 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

99 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

100 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

101 Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ Differentiating this becomes nicer (even though Z depends on θ)

102 Log-Likelihood Gradient Each component k is the difference between:

103 Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data

104 Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ thinks it computes for feature f k X' Yi

105 Lesson 6

106 Log-Likelihood Gradient Derivation y i

107 Log-Likelihood Gradient Derivation y i

108 Log-Likelihood Gradient Derivation y i use the (calculus) chain rule g log g(h θ ) = θ h(θ) h θ scalar p(x y i ) vector of functions

109 Log-Likelihood Gradient Derivation Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

110 Gradient Optimization Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F = f θ k x i, y i k i i y f k x i, y p y x i )

111 Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

112 Preventing Extreme Values Naïve Bayes Extreme values are 0 probabilities

113 Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values

114 Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values regularization

115 (Squared) L2 Regularization

116 Lesson 8

117 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

118 Revisiting the SNAP Function softmax

119 Revisiting the SNAP Function softmax

120 N-gram Language Models given some context w i-3 w i-2 w i-1 predict the next word w i

121 N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i

122 N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i

123 Maxent Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i

124 Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i

125 Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) can we learn word-specific weights (by type)? predict the next word w i

126 Neural Language Models given some context w i-3 w i-2 w i-1 create/use distributed representations e w e i-3 e i-2 e i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

127 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

128 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

129 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i

130 Baselines LM Name A Neural Probabilistic Language Model, Bengio et al. (2003) N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes

131 A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl No Yes No Yes 252

132 A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl No Yes No Yes 252 we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs) (Sect. 4.2)

Classification, Linear Models, Naïve Bayes

Classification, Linear Models, Naïve Bayes Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob Eisenstein Today Text classification problems and their evaluation Linear classifiers

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

Text Classification and Naïve Bayes

Text Classification and Naïve Bayes Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky Is this spam? Who wrote which Federalist papers? 1787-8: anonymous essays try to

More information

Probability Review and Naïve Bayes

Probability Review and Naïve Bayes Probability Review and Naïve Bayes Instructor: Alan Ritter Some slides adapted from Dan Jurfasky and Brendan O connor What is Probability? The probability the coin will land heads is 0.5 Q: what does this

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for

More information

Lecture 2: Probability, Naive Bayes

Lecture 2: Probability, Naive Bayes Lecture 2: Probability, Naive Bayes CS 585, Fall 205 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp205/ Brendan O Connor Today Probability Review Naive Bayes classification

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678

More information

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1

More information

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 Classification: Naïve Bayes Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 1 Sentiment Analysis Recall the task: Filled with horrific dialogue, laughable characters,

More information

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes 1 Categorization ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Important task for both humans and machines object identification face recognition spoken word recognition

More information

ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition

More information

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning) Bayes Theorem & Naïve Bayes (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning) Review: Bayes Theorem & Diagnosis P( a b) Posterior Likelihood Prior P( b a) P( a)

More information

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers and Logistic Regression Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers Combines all ideas we ve covered Conditional Independence Bayes Rule Statistical Estimation

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

A fast and simple algorithm for training neural probabilistic language models

A fast and simple algorithm for training neural probabilistic language models A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture? Bayes Rule CS789: Machine Learning and Neural Network Bayesian learning P (Y X) = P (X Y )P (Y ) P (X) Jakramate Bootkrajang Department of Computer Science Chiang Mai University P (Y ): prior belief, prior

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

CSC 411: Lecture 09: Naive Bayes

CSC 411: Lecture 09: Naive Bayes CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Statistical NLP Spring A Discriminative Approach

Statistical NLP Spring A Discriminative Approach Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of

More information

naive bayes document classification

naive bayes document classification naive bayes document classification October 31, 2018 naive bayes document classification 1 / 50 Overview 1 Text classification 2 Naive Bayes 3 NB theory 4 Evaluation of TC naive bayes document classification

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far

More information

Neural Network Language Modeling

Neural Network Language Modeling Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning 1 IN4080 2018 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning 2 Logistic regression Lecture 8, 26 Sept Today 3 Recap: HMM-tagging Generative and discriminative classifiers Linear classifiers Logistic

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations Bowl Maximum Entropy #4 By Ejay Weiss Maxent Models: Maximum Entropy Foundations By Yanju Chen A Basic Comprehension with Derivations Outlines Generative vs. Discriminative Feature-Based Models Softmax

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Behavioral Data Mining. Lecture 2

Behavioral Data Mining. Lecture 2 Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Generative Classifiers: Part 1 CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang 1 This Week Discriminative vs Generative Models Simple Model: Does the patient

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc) NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Inf2b Learning and Data

Inf2b Learning and Data Inf2b Learning and Data Lecture 13: Review (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University of Edinburgh http://www.inf.ed.ac.uk/teaching/courses/inf2b/

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm

More information

Natural Language Processing (CSEP 517): Text Classification

Natural Language Processing (CSEP 517): Text Classification Natural Language Processing (CSEP 517): Text Classification Noah Smith c 2017 University of Washington nasmith@cs.washington.edu April 10, 2017 1 / 71 To-Do List Online quiz: due Sunday Read: Jurafsky

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

CSC321 Lecture 7 Neural language models

CSC321 Lecture 7 Neural language models CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We

More information

MLPR: Logistic Regression and Neural Networks

MLPR: Logistic Regression and Neural Networks MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

More information

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Slides credit: Graham Neubig Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation

More information

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap. Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 5: Text classification (Feb 5, 2019) David Bamman, UC Berkeley Data Classification A mapping h from input data x (drawn from instance space X) to a

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University Generative Models CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Bayes decision rule Bayes theorem Generative

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Machine Learning Algorithm. Heejun Kim

Machine Learning Algorithm. Heejun Kim Machine Learning Algorithm Heejun Kim June 12, 2018 Machine Learning Algorithms Machine Learning algorithm: a procedure in developing computer programs that improve their performance with experience. Types

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Lecture 2: Logistic Regression and Neural Networks

Lecture 2: Logistic Regression and Neural Networks 1/23 Lecture 2: and Neural Networks Pedro Savarese TTI 2018 2/23 Table of Contents 1 2 3 4 3/23 Naive Bayes Learn p(x, y) = p(y)p(x y) Training: Maximum Likelihood Estimation Issues? Why learn p(x, y)

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten, E. Frank and M. A. Hall Statistical modeling Opposite of R: use all the attributes Two assumptions:

More information