Naïve Bayes, Maxent and Neural Models
|
|
- Chad Gaines
- 5 years ago
- Views:
Transcription
1 Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP
2 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
3 Probabilistic Classification Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding
4 Posterior Decoding: Probabilistic Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis class class-based likelihood (language model) prior probability of class observed data observation likelihood (averaged over all classes)
5 Noisy Channel Model Decode Rerank what I want to tell you sports what you actually see The Os lost again hypothesized intent sad stories sports reweight according to what s likely sports
6 Noisy Channel Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning possible (clean) output translation/ decode model (clean) language model observed (noisy) text observation (noisy) likelihood
7 Use Logarithms
8 Accuracy, Precision, and Recall Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
9 A combined measure: F Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1
10 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
11 The Bag of Words Representation
12 The Bag of Words Representation
13 The Bag of Words Representation 13
14 Bag of Words Representation seen 2 classifier sweet 1 γ( )=c whimsical 1 classifier recommend 1 happy
15 Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training?
16 Naïve Bayes Classifier label text Start with Bayes Rule Q: Are we doing discriminative training or generative training? A: generative training
17 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i
18 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter
19 Naïve Bayes Classifier label each word (token) Adopt naïve bag of words representation Y i Assume position doesn t matter Assume the feature probabilities are independent given the class X
20 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary
21 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j
22 Brill and Banko (2001) With enough data, the classifier may not matter
23 Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j = all docs with class =c j Calculate P(w k c j ) terms Text j = single doc containing all docs j For each word w k in Vocabulary n k = # of occurrences of w k in Text j p w k c j = class unigram LM
24 Naïve Bayes and Language Modeling Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides We use only word features we use all of the words in the text (not a subset) Then Naïve Bayes has an important similarity to language modeling
25 Sec Naïve Bayes as a Language Model Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I love 0.01 this fun 0.1 film
26 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love love 0.01 this 0.01 this 0.05 fun fun 0.1 film 0.1 film
27 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.2 I I love this fun film 0.1 love 0.01 this love 0.01 this fun fun 0.1 film 0.1 film
28 Sec Naïve Bayes as a Language Model Which class assigns the higher probability to s? Positive Model Negative Model 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film 0.2 I love 0.01 this fun 0.1 film I love this fun film e-7 P(s pos) > P(s neg) 1e-9
29 Summary: Naïve Bayes is Not So Naïve Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)
30 But: Naïve Bayes Isn t Without Issue Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there better ways of handling missing/noisy data? (automated, more principled)
31 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
32 Connections to Other Techniques Log-Linear Models as statistical regression based in information theory a form of viewed as to be cool today :) (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets
33 Maxent Models for Classification: Discriminatively or Generatively Trained Directly model the posterior Discriminatively trained classifier Model the posterior with Bayes rule Generatively trained classifier
34 Maximum Entropy (Log-linear) Models discriminatively trained: classify in one go
35 Maximum Entropy (Log-linear) Models generatively trained: learn to model language
36 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK
37 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK # killed: Type: Perp:
38 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK
39 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK
40 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK
41 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK
42 Document Classification Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK
43 We need to score the different combinations. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. ATTACK
44 Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) score k (department, ATTACK) are all of these uncorrelated? COMBINE posterior probability of ATTACK
45 Score and Combine Our Possibilities score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) COMBINE posterior probability of ATTACK Q: What are the score and combine functions for Naïve Bayes?
46 Scoring Our Possibilities Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. score(, ) = ATTACK score 1 (fatally shot, ATTACK) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)
47 Lesson 1
48 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. SNAP(score(, )) ATTACK
49 What function operates on any real number? is never less than 0?
50 What function operates on any real number? is never less than 0? f(x) = exp(x)
51 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. exp(score(, )) ATTACK
52 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK)
53 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( score 1 (fatally shot, ATTACK) )) score 2 (seriously wounded, ATTACK) score 3 (Shining Path, ATTACK) Learn the scores (but we ll declare what combinations should be looked at)
54 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) ATTACK exp( weight 1 * occurs 1 (fatally shot, ATTACK) )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK)
55 Maxent Modeling: Feature Functions p( ) ATTACK Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. weight 1 * occurs 1 (fatally shot, ATTACK) exp( )) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be realvalued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary
56 More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued??? Non-templated count-valued
57 More on Feature Functions Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued occurs target,type fatally shot, ATTACK = 1, target == fatally shot and type == ATTACK ቊ 0, otherwise binary occurs target,type fatally shot, ATTACK = log p fatally shot ATTACK) + log p type ATTACK) + log p(attack type) Templated realvalued occurs fatally shot, ATTACK = log p fatally shot ATTACK) Non-templated real-valued occurs fatally shot, ATTACK = count fatally shot ATTACK) Non-templated count-valued
58 Maxent Modeling Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. p( ) = ATTACK 1 Z Q: How do we define Z? weight 1 * applies 1 (fatally shot, ATTACK) exp( )) weight 2 * applies 2 (seriously wounded, ATTACK) weight 3 * applies 3 (Shining Path, ATTACK)
59 Σ label x Normalization for Classification Z = exp( weight 1 * occurs 1 (fatally shot, ATTACK) ) weight 2 * occurs 2 (seriously wounded, ATTACK) weight 3 * occurs 3 (Shining Path, ATTACK) p x y) exp(θ f x, y ) classify doc y with label x in one go
60 Normalization for Language Model general class-based (X) language model of doc y
61 Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case
62 Normalization for Language Model general class-based (X) language model of doc y Can be significantly harder in the general case Simplifying assumption: maxent n-grams!
63 Understanding Conditioning Is this a good language model?
64 Understanding Conditioning Is this a good language model?
65 Understanding Conditioning Is this a good language model? (no)
66 Understanding Conditioning Is this a good posterior classifier? (no)
67 Lesson 11
68 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
69 p θ (x y ) probabilistic model objective (given observations)
70 Objective = Full Likelihood? These values can have very small magnitude underflow Differentiating this product could be a pain
71 Logarithms (0, 1] (-, 0] Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Inverse of exp log(exp(x)) = x
72 Log-Likelihood Wide range of (negative) numbers Sums are more stable Products Sums log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) Differentiating this becomes nicer (even though Z depends on θ)
73 Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x Differentiating this becomes nicer (even though Z depends on θ)
74 Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ
75 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
76 How will we optimize F(θ)? Calculus
77 F(θ) θ
78 F(θ) θ* θ
79 F (θ) derivative of F wrt θ F(θ) θ* θ
80 Example F(x) = -(x-2) 2 differentiate F (x) = -2x + 4 x = 2 Solve F (x) = 0
81 Common Derivative Rules
82 What if you can t find the roots? Follow the derivative F (θ) derivative of F wrt θ F(θ) θ* θ
83 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) F (θ) derivative of F wrt θ y 0 F(θ) θ 0 θ* θ
84 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) F (θ) derivative of F wrt θ y 0 F(θ) g 0 θ 0 θ* θ
85 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 0 g 0 θ 0 θ 1 θ* F(θ) θ
86 What if you can t find the roots? Follow the derivative Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 θ 0 θ 1 θ 2 θ* F(θ) θ
87 What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ
88 What if you can t find the roots? Follow the derivative y 2 y 3 Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F (θ) derivative of F wrt θ y 1 y 0 g 0 g 1 g 2 θ 0 θ 1 θ 2 θ 3 θ* F(θ) θ
89 Gradient = Multi-variable derivative K-dimensional input K-dimensional output
90 Gradient Ascent
91 Gradient Ascent
92 Gradient Ascent
93 Gradient Ascent
94 Gradient Ascent
95 Gradient Ascent
96 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
97 Expectations number of pieces of candy /6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5
98 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
99 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
100 Expectations number of pieces of candy /2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5
101 Log-Likelihood Wide range of (negative) numbers Sums are more stable = F θ Differentiating this becomes nicer (even though Z depends on θ)
102 Log-Likelihood Gradient Each component k is the difference between:
103 Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data
104 Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ thinks it computes for feature f k X' Yi
105 Lesson 6
106 Log-Likelihood Gradient Derivation y i
107 Log-Likelihood Gradient Derivation y i
108 Log-Likelihood Gradient Derivation y i use the (calculus) chain rule g log g(h θ ) = θ h(θ) h θ scalar p(x y i ) vector of functions
109 Log-Likelihood Gradient Derivation Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?
110 Gradient Optimization Set t = 0 Pick a starting value θ t Until converged: 1. Get value y t = F(θ t ) 2. Get derivative g t = F (θ t ) 3. Get scaling factor ρ t 4. Set θ t+1 = θ t + ρ t *g t 5. Set t += 1 F = f θ k x i, y i k i i y f k x i, y p y x i )
111 Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?
112 Preventing Extreme Values Naïve Bayes Extreme values are 0 probabilities
113 Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values
114 Preventing Extreme Values Naïve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large θ values regularization
115 (Squared) L2 Regularization
116 Lesson 8
117 Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words Naïve assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models
118 Revisiting the SNAP Function softmax
119 Revisiting the SNAP Function softmax
120 N-gram Language Models given some context w i-3 w i-2 w i-1 predict the next word w i
121 N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i
122 N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) count(w i 3, w i 2, w i 1, w i ) predict the next word w i
123 Maxent Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i
124 Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ f(w i 3, w i 2, w i 1, w i )) predict the next word w i
125 Neural Language Models given some context w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) can we learn word-specific weights (by type)? predict the next word w i
126 Neural Language Models given some context w i-3 w i-2 w i-1 create/use distributed representations e w e i-3 e i-2 e i-1 compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i
127 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i
128 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i
129 Neural Language Models given some context create/use distributed w i-3 w i-2 w i-1 representations e w e i-3 e i-2 e i-1 combine these representations θ wi matrix-vector product C = f compute beliefs about what is likely p w i w i 3, w i 2, w i 1 ) softmax(θ wi f(w i 3, w i 2, w i 1 )) predict the next word w i
130 Baselines LM Name A Neural Probabilistic Language Model, Bengio et al. (2003) N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes
131 A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl No Yes No Yes 252
132 A Neural Probabilistic Language Model, Bengio et al. (2003) Baselines LM Name N- gram Params. Test Ppl. Interpolation Kneser-Ney backoff Kneser-Ney backoff Class-based backoff Class-based backoff classes 500 classes NPLM N-gram Word Vector Dim. Hidden Dim. Mix with nonneural LM Ppl No Yes No Yes 252 we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs) (Sect. 4.2)
Classification, Linear Models, Naïve Bayes
Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob Eisenstein Today Text classification problems and their evaluation Linear classifiers
More informationDirected Probabilistic Graphical Models CMSC 678 UMBC
Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,
More informationText Classification and Naïve Bayes
Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky Is this spam? Who wrote which Federalist papers? 1787-8: anonymous essays try to
More informationProbability Review and Naïve Bayes
Probability Review and Naïve Bayes Instructor: Alan Ritter Some slides adapted from Dan Jurfasky and Brendan O connor What is Probability? The probability the coin will land heads is 0.5 Q: what does this
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for
More informationLecture 2: Probability, Naive Bayes
Lecture 2: Probability, Naive Bayes CS 585, Fall 205 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp205/ Brendan O Connor Today Probability Review Naive Bayes classification
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLoss Functions, Decision Theory, and Linear Models
Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678
More informationLogistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor
Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1
More informationClassification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016
Classification: Naïve Bayes Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 1 Sentiment Analysis Recall the task: Filled with horrific dialogue, laughable characters,
More informationCategorization ANLP Lecture 10 Text Categorization with Naive Bayes
1 Categorization ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Important task for both humans and machines object identification face recognition spoken word recognition
More informationANLP Lecture 10 Text Categorization with Naive Bayes
ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition
More informationBayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)
Bayes Theorem & Naïve Bayes (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning) Review: Bayes Theorem & Diagnosis P( a b) Posterior Likelihood Prior P( b a) P( a)
More informationNaïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014
Naïve Bayes Classifiers and Logistic Regression Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers Combines all ideas we ve covered Conditional Independence Bayes Rule Statistical Estimation
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification
More informationA fast and simple algorithm for training neural probabilistic language models
A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationBayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?
Bayes Rule CS789: Machine Learning and Neural Network Bayesian learning P (Y X) = P (X Y )P (Y ) P (X) Jakramate Bootkrajang Department of Computer Science Chiang Mai University P (Y ): prior belief, prior
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationMaxent Models and Discriminative Estimation
Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationCS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev
CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts
More informationLast Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationCSC 411: Lecture 09: Naive Bayes
CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationMIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering
More informationStatistical NLP Spring A Discriminative Approach
Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,
More informationSupport Vector Machines
Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of
More informationnaive bayes document classification
naive bayes document classification October 31, 2018 naive bayes document classification 1 / 50 Overview 1 Text classification 2 Naive Bayes 3 NB theory 4 Evaluation of TC naive bayes document classification
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationCSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9
CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification
More informationGenerative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul
Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far
More informationNeural Network Language Modeling
Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationIN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning
1 IN4080 2018 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning 2 Logistic regression Lecture 8, 26 Sept Today 3 Recap: HMM-tagging Generative and discriminative classifiers Linear classifiers Logistic
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationBowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations
Bowl Maximum Entropy #4 By Ejay Weiss Maxent Models: Maximum Entropy Foundations By Yanju Chen A Basic Comprehension with Derivations Outlines Generative vs. Discriminative Feature-Based Models Softmax
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationBehavioral Data Mining. Lecture 2
Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationGenerative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang
Generative Classifiers: Part 1 CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang 1 This Week Discriminative vs Generative Models Simple Model: Does the patient
More informationIntroduction to Machine Learning
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationNLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)
NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationLinear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)
Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationInf2b Learning and Data
Inf2b Learning and Data Lecture 13: Review (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University of Edinburgh http://www.inf.ed.ac.uk/teaching/courses/inf2b/
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationMachine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm
More informationNatural Language Processing (CSEP 517): Text Classification
Natural Language Processing (CSEP 517): Text Classification Noah Smith c 2017 University of Washington nasmith@cs.washington.edu April 10, 2017 1 / 71 To-Do List Online quiz: due Sunday Read: Jurafsky
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationCSC321 Lecture 7 Neural language models
CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We
More informationMLPR: Logistic Regression and Neural Networks
MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer
More informationNeural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig
Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Slides credit: Graham Neubig Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation
More informationOutline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.
Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 5: Text classification (Feb 5, 2019) David Bamman, UC Berkeley Data Classification A mapping h from input data x (drawn from instance space X) to a
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationIntroduction to Machine Learning
Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationGenerative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University
Generative Models CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Bayes decision rule Bayes theorem Generative
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationMachine Learning Algorithm. Heejun Kim
Machine Learning Algorithm Heejun Kim June 12, 2018 Machine Learning Algorithms Machine Learning algorithm: a procedure in developing computer programs that improve their performance with experience. Types
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLecture 2: Logistic Regression and Neural Networks
1/23 Lecture 2: and Neural Networks Pedro Savarese TTI 2018 2/23 Table of Contents 1 2 3 4 3/23 Naive Bayes Learn p(x, y) = p(y)p(x y) Training: Maximum Likelihood Estimation Issues? Why learn p(x, y)
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten, E. Frank and M. A. Hall Statistical modeling Opposite of R: use all the attributes Two assumptions:
More information