Natural Language Processing (CSEP 517): Text Classification

Size: px
Start display at page:

Download "Natural Language Processing (CSEP 517): Text Classification"

Transcription

1 Natural Language Processing (CSEP 517): Text Classification Noah Smith c 2017 University of Washington nasmith@cs.washington.edu April 10, / 71

2 To-Do List Online quiz: due Sunday Read: Jurafsky and Martin (2016b); Collins (2011); Jurafsky and Martin (2016a) A2 due April 23 (Sunday) Final exam will be on Canvas (posted 5/29, due 6/4) 2 / 71

3 Text Classification Input: a piece of text x V, usually a document (r.v. X) Output: a label from a finite set L (r.v. L) Standard line of attack: 1. Human experts label some data. 2. Feed the data to a supervised machine learning algorithm that constructs an automatic classifier classify : V L 3. Apply classify to as much data as you want! Note: we assume the texts are segmented already, even the new ones. 3 / 71

4 Text Classification: Examples Library-like subjects (e.g., the Dewey decimal system) News stories: politics vs. sports vs. business vs. technology... Reviews of films, restaurants, products: postive vs. negative Author attributes: identity, political stance, gender, age,... , arxiv submissions, etc.: spam vs. not What is the reading level of a piece of text? How influential will a scientific paper be? Will a piece of proposed legislation pass? Closely related: relevance to a query. 4 / 71

5 Evaluation Accuracy: A(classify) = p(classify(x) = L) = { 1 if classify(x) = l p(x = x, L = l) 0 otherwise x V,l L = p(x = x, L = l) 1 {classify(x) = l} x V,l L where p is the true distribution over data. Error is 1 A. This is estimated using a test dataset x 1, l 1,... x m, l m : Â(classify) = 1 m m 1 { classify( x i ) = l } i i=1 5 / 71

6 Issues with Test-Set Accuracy 6 / 71

7 Issues with Test-Set Accuracy Class imbalance: if p(l = not spam) = 0.99, then you can get  0.99 by always guessing not spam. 7 / 71

8 Issues with Test-Set Accuracy Class imbalance: if p(l = not spam) = 0.99, then you can get  0.99 by always guessing not spam. Relative importance of classes or cost of error types. 8 / 71

9 Issues with Test-Set Accuracy Class imbalance: if p(l = not spam) = 0.99, then you can get  0.99 by always guessing not spam. Relative importance of classes or cost of error types. Variance due to the test data. 9 / 71

10 Evaluation in the Two-Class Case Suppose we have two classes, and one of them, t L is a target. E.g., given a query, find relevant documents. Precision and recall encode the goals of returning a pure set of targeted instances and capturing all of them. A C B actually in the target class; L = t correctly labeled as t believed to be in the target class; classify(x) = t ˆP(classify) = C A B = B B ˆR(classify) = C A B = A A ˆP ˆF 1 (classify) = 2 ˆR ˆP + ˆR 10 / 71

11 Another View: Contingency Table L = t L t classify(x) = t C (true positives) B \ C (false positives) B classify(x) t A \ C (false negatives) (true negatives) A 11 / 71

12 Evaluation with > 2 Classes Macroaveraged precision and recall: let each class be the target and report the average ˆP and ˆR across all classes. Microaveraged precision and recall: pool all one-vs.-rest decisions into a single contingency table, calculate ˆP and ˆR from that. 12 / 71

13 Cross-Validation Remember that Â, ˆP, ˆR, and ˆF 1 are all estimates of the classifier s quality under the true data distribution. Estimates are noisy! K-fold cross-validation: Partition the training set into K non-overlapping folds x 1,..., x K. For i {1,..., K}: Train on x 1:n \ x i, using x i as development data. Estimate quality on the ith development set: Â i Report the average: Â = 1 K K i=1 Â i and perhaps also the standard error. 13 / 71

14 Statistical Significance Suppose we have two classifiers, classify 1 and classify / 71

15 Statistical Significance Suppose we have two classifiers, classify 1 and classify 2. Is classify 1 better? The null hypothesis, denoted H 0, is that it isn t. But if  1 Â2, we are tempted to believe otherwise. 15 / 71

16 Statistical Significance Suppose we have two classifiers, classify 1 and classify 2. Is classify 1 better? The null hypothesis, denoted H 0, is that it isn t. But if  1 Â2, we are tempted to believe otherwise. How much larger must Â1 be than Â2 to reject H 0? 16 / 71

17 Statistical Significance Suppose we have two classifiers, classify 1 and classify 2. Is classify 1 better? The null hypothesis, denoted H 0, is that it isn t. But if  1 Â2, we are tempted to believe otherwise. How much larger must Â1 be than Â2 to reject H 0? Frequentist view: how (im)probable is the observed difference, given H 0 = true? 17 / 71

18 Statistical Significance Suppose we have two classifiers, classify 1 and classify 2. Is classify 1 better? The null hypothesis, denoted H 0, is that it isn t. But if  1 Â2, we are tempted to believe otherwise. How much larger must Â1 be than Â2 to reject H 0? Frequentist view: how (im)probable is the observed difference, given H 0 = true? Caution: statistical significance is neither necessary nor sufficient for research significance or practical usefulness! 18 / 71

19 A Hypothesis Test for Text Classifiers McNemar (1947) 1. The null hypothesis: A 1 = A 2 2. Pick significance level α, an acceptably high probability of incorrectly rejecting H Calculate the test statistic, k (explained in the next slide). 4. Calculate the probability of a more extreme value of k, assuming H 0 is true; this is the p-value. 5. Reject the null hypothesis if the p-value is less than α. The p-value is p(this observation H 0 is true), not the other way around! 19 / 71

20 McNemar s Test: Details Assumptions: independent (test) samples and binary measurements. Count test set error patterns: classify 1 is incorrect classify 1 is correct classify 2 is incorrect c 00 c 10 classify 2 is correct c 01 c 11 m Â2 m Â1 If A 1 = A 2, then c 01 and c 10 are each distributed according to Binomial(c 01 + c 10, 1 2 ). test statistic k = min{c 01, c 10 } 1 k ( ) c01 + c 10 p-value = 2 c 01+c 10 1 j j=0 20 / 71

21 Other Tests Different tests make different assumptions. Sometimes we calculate an interval that would be unsurprising under H 0 and test whether a test statistic falls in that interval (e.g., t-test and Wald test). In many cases, there is no closed form for estimating p-values, so we use random approximations (e.g., permutation test and paired bootstrap test). If you do lots of tests, you need to correct for that! Read lots more in Smith (2011), appendix B. 21 / 71

22 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. 22 / 71

23 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. Often, these are term (word and perhaps n-gram) frequencies. E.g., f hamburgers (x) = 1, f the (x) = 2, f delicious (x) = 0, f don t touch (x) = / 71

24 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. Often, these are term (word and perhaps n-gram) frequencies. E.g., f hamburgers (x) = 1, f the (x) = 2, f delicious (x) = 0, f don t touch (x) = 1. Can also be word presence features. E.g., f hamburgers (x) = 1, f the (x) = 1, f delicious (x) = 0, f don t touch (x) = / 71

25 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. Often, these are term (word and perhaps n-gram) frequencies. E.g., f hamburgers (x) = 1, f the (x) = 2, f delicious (x) = 0, f don t touch (x) = 1. Can also be word presence features. E.g., f hamburgers (x) = 1, f the (x) = 1, f delicious (x) = 0, f don t touch (x) = 1. Transformations on word frequencies: logarithm, idf weighting v V, idf(v) = log n i : c xi (v) > 0 25 / 71

26 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. Often, these are term (word and perhaps n-gram) frequencies. E.g., f hamburgers (x) = 1, f the (x) = 2, f delicious (x) = 0, f don t touch (x) = 1. Can also be word presence features. E.g., f hamburgers (x) = 1, f the (x) = 1, f delicious (x) = 0, f don t touch (x) = 1. Transformations on word frequencies: logarithm, idf weighting Disjunctions of terms Clusters Task-specific lexicons v V, idf(v) = log n i : c xi (v) > 0 26 / 71

27 Probabilistic Classification Classification rule: classify(f) = argmax p(l f) l L = argmax l L p(l, f) p(f) = argmax p(l, f) l L 27 / 71

28 Naïve Bayes Classifier Parameters: p(l = l, F 1 = f 1,..., F d = f d ) = p(l) = π l d j=1 d p(f j = f j l) j=1 θ fj j,l π is the class prior (it sums to one) For each feature function j and label l, a distribution over values θ j,l (sums to one for every j, l pair) The bag of words version of naïve Bayes: F j = X j x p(l, x) = π l j=1 θ xj l 28 / 71

29 Naïve Bayes: Remarks Estimation by (smoothed) relative frequency estimation: easy! 29 / 71

30 Naïve Bayes: Remarks Estimation by (smoothed) relative frequency estimation: easy! For continuous or integer-valued features, use different distributions. 30 / 71

31 Naïve Bayes: Remarks Estimation by (smoothed) relative frequency estimation: easy! For continuous or integer-valued features, use different distributions. The bag of words version equates to building a conditional language model for each label. 31 / 71

32 Naïve Bayes: Remarks Estimation by (smoothed) relative frequency estimation: easy! For continuous or integer-valued features, use different distributions. The bag of words version equates to building a conditional language model for each label. The Collins reading assumes a binary version, with F v indicating whether v V occurs in x. 32 / 71

33 Generative vs. Discriminative Classification Naïve Bayes is the prototypical generative classifier. It describes a probabilistic process generative story for X and L. But why model X? It s always observed? Discriminative models instead: seek to optimize a performance measure, like accuracy, or a computationally convenient surrogate; do not worry about p(x); tend to perform better when you have reasonable amounts of data. 33 / 71

34 Discriminative Text Classifiers Multinomial logistic regression (also known as max ent and log-linear ) Support vector machines Neural networks Decision trees I ll briefly touch on three ways to train a classifier with a linear decision rule. 34 / 71

35 Linear Models for Classification Linear decision rule: where φ : V L R d. ˆl = argmax w φ(x, l) l L Parameters: w R d What does this remind you of? 35 / 71

36 Linear Models for Classification Linear decision rule: where φ : V L R d. Parameters: w R d What does this remind you of? Some notational variants define: w l for each l L ˆl = argmax w φ(x, l) l L φ : V R d (similar to what we had for naïve Bayes) 36 / 71

37 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 As a simple example, let the two features be binary functions. 37 / 71

38 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 w φ = w 1 φ 1 + w 2 φ 2 = 0 38 / 71

39 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 distance(w φ = 0, φ 0 ) = w φ 0 w 2 w φ 0 39 / 71

40 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 w φ(x, y 1 ) > w φ(x, y 3 ) > w φ(x, y 4 ) > 0 w φ(x, y 2 ) 40 / 71

41 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 41 / 71

42 MLE for Multinomial Logistic Regression When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: p(l = l x) = exp w φ(x, l) l L exp w φ(x, l ) 42 / 71

43 MLE for Multinomial Logistic Regression When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: p(l = l x) = exp w φ(x, l) l L exp w φ(x, l ) MLE can be rewritten as a maximization problem: ŵ = argmax w n w φ(x i, l i ) log exp w φ(x }{{} i, l ) l hope L }{{} fear i=1 43 / 71

44 MLE for Multinomial Logistic Regression When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: exp w φ(x, l) p(l = l x) = l L exp w φ(x, l ) MLE can be rewritten as a maximization problem: ŵ = argmax w Recall from language models: Be wise and regularize! n w φ(x i, l i ) log exp w φ(x }{{} i, l ) l hope L }{{} fear i=1 Solve with batch or stochastic gradient methods. w j has an interpretation. 44 / 71

45 Log Loss for (x, l) Another view is to minimize the negated log-likelihood, which is known as log loss : fear {( }}{ log l L exp w φ(x, l ) ) hope {}}{ w φ(x, l) In the binary case, where the x-axis is the difference in scores between correct and incorrect labels: loss score difference In blue is the log loss; in red is the zero-one loss (error). 45 / 71

46 Log Sum Exp Below, y-axis plots the log exp part of the objective function (with two labels), against x, assuming the other score is one of {8, 0, 8} log(e x + e 8 ), log(e x + e 0 ), log(e x + e 8 ) 46 / 71

47 Hard Maximum Why not use a hard max instead? max(x, 8), max(x, 0), max(x, 8) 47 / 71

48 Hinge Loss for (x, l) ( ) max w φ(x, l L l ) w φ(x, l) In the binary case: loss score difference In purple is the hinge loss, in blue is the log loss; in red is the zero-one loss (error). 48 / 71

49 Minimizing Hinge Loss: Perceptron fear {( }} ){ hope {}}{ max w φ(x, l L l ) w φ(x, l) 49 / 71

50 Minimizing Hinge Loss: Perceptron fear {( }} ){ hope {}}{ max w φ(x, l L l ) w φ(x, l) When two labels are tied, the function is not differentiable. 50 / 71

51 Minimizing Hinge Loss: Perceptron fear {( }} ){ hope {}}{ max w φ(x, l L l ) w φ(x, l) When two labels are tied, the function is not differentiable. But it s still sub-differentiable. Solution: (stochastic) subgradient descent! 51 / 71

52 Log Loss and Hinge Loss for (x, l) In the binary case: log loss: hinge loss: ( log l L exp w φ(x, l ) ) w φ(x, l) ( ) max w φ(x, l L l ) w φ(x, l) loss score difference In purple is the hinge loss, in blue is the log loss; in red is the zero-one loss (error). 52 / 71

53 Minimizing Hinge Loss: Perceptron min w n i=1 ( ) max w φ(x i, l ) w φ(x i, l i ) l L Stochastic subgradient descent on the above is called the perceptron algorithm. For t {1,..., T }: Pick i t uniformly at random from {1,..., n}. ˆlit argmax ( l L w φ(x it, l) w w α φ(x it, ˆl ) it ) φ(x it, l it ) 53 / 71

54 Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. 54 / 71

55 Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost(l, l ) quantify the badness of substituting l for correct label l. 55 / 71

56 Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost(l, l ) quantify the badness of substituting l for correct label l. Intuition: estimate the scoring function so that score(l i ) score(ˆl) cost(l i, ˆl) 56 / 71

57 General Hinge Loss for (x, l) ( ) max w φ(x, l L l ) + cost(l, l ) w φ(x, l) In the binary case, with cost( 1, 1) = 1: In blue is the general hinge loss; in red is the zero-one loss (error). 57 / 71

58 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. 58 / 71

59 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. Naïve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. You should have a basic understanding of the tradeoffs in choosing among them. 59 / 71

60 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. Naïve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. You should have a basic understanding of the tradeoffs in choosing among them. Rumor: random forests are widely used in industry when performance matters more than interpretability. 60 / 71

61 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. Naïve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. You should have a basic understanding of the tradeoffs in choosing among them. Rumor: random forests are widely used in industry when performance matters more than interpretability. Lots of papers about neural networks, though with hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015). 61 / 71

62 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. Naïve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. You should have a basic understanding of the tradeoffs in choosing among them. Rumor: random forests are widely used in industry when performance matters more than interpretability. Lots of papers about neural networks, though with hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015). Lots of work on feature design; most of the more sophisticated analysis methods we cover starting next week can provide features for text categorization! 62 / 71

63 References I Michael Collins. The naive Bayes model, maximum-likelihood estimation, and the EM algorithm, URL Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2(5): , Daniel Jurafsky and James H. Martin. Logistic regression (draft chapter), 2016a. URL Daniel Jurafsky and James H. Martin. Naive Bayes and sentiment classification (draft chapter), 2016b. URL Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2): , Noah A. Smith. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies. Morgan and Claypool, URL Dani Yogatama, Lingpeng Kong, and Noah A. Smith. Bayesian optimization of text representations. In Proc. of EMNLP, URL 63 / 71

64 Extras 64 / 71

65 (Linear) Support Vector Machines A different motivation for the generalized hinge: ŵ = n α i,l φ(x i, l) i=1 l L where most only a small number of α i,l are nonzero. 65 / 71

66 (Linear) Support Vector Machines A different motivation for the generalized hinge: ŵ = n α i,l φ(x i, l) i=1 l L where most only a small number of α i,l are nonzero. Those φ(x i, l) are called support vectors because they support the decision boundary. ŵ φ(x, l ) = α i,l φ(x i, l) φ(x, l ) (i,l) S See Crammer and Singer (2001) for the multiclass version. 66 / 71

67 (Linear) Support Vector Machines A different motivation for the generalized hinge: ŵ = n α i,l φ(x i, l) i=1 l L where most only a small number of α i,l are nonzero. Those φ(x i, l) are called support vectors because they support the decision boundary. ŵ φ(x, l ) = α i,l φ(x i, l) φ(x, l ) (i,l) S See Crammer and Singer (2001) for the multiclass version. Really good tool: SVM light, 67 / 71

68 Support Vector Machines: Remarks Regularization is critical; squared l 2 is most common, and often used in (yet another) motivation around the idea of maximizing margin around the hyperplane separator. 68 / 71

69 Support Vector Machines: Remarks Regularization is critical; squared l 2 is most common, and often used in (yet another) motivation around the idea of maximizing margin around the hyperplane separator. Often, instead of linear models that explicitly calculate w φ, these methods are kernelized and rearrange all calculations to involve inner-products between φ vectors. Example: K linear (v, w) = v w K polynomial (v, w) = (v w + 1) p Linear kernels are most common in NLP. K Gaussian (v, w) = exp v w 2 2 2σ 2 69 / 71

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

ML4NLP Multiclass Classification

ML4NLP Multiclass Classification ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Classification, Linear Models, Naïve Bayes

Classification, Linear Models, Naïve Bayes Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob Eisenstein Today Text classification problems and their evaluation Linear classifiers

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Machine Learning (CS 567) Lecture 3

Machine Learning (CS 567) Lecture 3 Machine Learning (CS 567) Lecture 3 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Machine Learning (CSE 446): Probabilistic Machine Learning

Machine Learning (CSE 446): Probabilistic Machine Learning Machine Learning (CSE 446): Probabilistic Machine Learning oah Smith c 2017 University of Washington nasmith@cs.washington.edu ovember 1, 2017 1 / 24 Understanding MLE y 1 MLE π^ You can think of MLE as

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 Classification: Naïve Bayes Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 1 Sentiment Analysis Recall the task: Filled with horrific dialogue, laughable characters,

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 23: Perceptrons 11/20/2008 Dan Klein UC Berkeley 1 General Naïve Bayes A general naive Bayes model: C E 1 E 2 E n We only specify how each feature depends

More information

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering CS 188: Artificial Intelligence Fall 2008 General Naïve Bayes A general naive Bayes model: C Lecture 23: Perceptrons 11/20/2008 E 1 E 2 E n Dan Klein UC Berkeley We only specify how each feature depends

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning Questions? Greg Grudic Machine Learning 2 Binary Classification A binary classifier is a mapping from a set of d inputs to a single output

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

EE 511 Online Learning Perceptron

EE 511 Online Learning Perceptron Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Midterm Exam Solutions, Spring 2007

Midterm Exam Solutions, Spring 2007 1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you

More information