Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Similar documents
Generative Models for Classification

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Bayes classifier

Introduction to Machine Learning

A Support Vector Method for Multivariate Performance Measures

Introduction to Machine Learning

Machine Learning for natural language processing

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning Algorithm. Heejun Kim

Bayesian Methods: Naïve Bayes

Classification, Linear Models, Naïve Bayes

Machine Learning, Fall 2012 Homework 2

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Day 5: Generative models, structured classification

Introduction to Machine Learning

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

CS145: INTRODUCTION TO DATA MINING

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Empirical Risk Minimization, Model Selection, and Model Assessment

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Generative Learning algorithms

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Statistical Methods for NLP

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

naive bayes document classification

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Chapter 6 Classification and Prediction (2)

Introduction to NLP. Text Classification

Logistic Regression. Machine Learning Fall 2018

Information Retrieval and Organisation

Machine Learning (CS 567) Lecture 5


Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

CSE446: Naïve Bayes Winter 2016

Machine Learning. Naïve Bayes classifiers

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Probability Review and Naïve Bayes

Introduction to Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Learning (II)

Stephen Scott.

Naïve Bayes, Maxent and Neural Models

Machine Learning 2017

Machine Learning Linear Classification. Prof. Matteo Matteucci

Support Vector and Kernel Methods

FINAL: CS 6375 (Machine Learning) Fall 2014

MIRA, SVM, k-nn. Lirong Xia

Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Statistical Learning Theory: Generalization Error Bounds

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Estimating Parameters

Behavioral Data Mining. Lecture 2

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Gaussian discriminant analysis Naive Bayes

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Lecture 4 Discriminant Analysis, k-nearest Neighbors

CSE546: Naïve Bayes Winter 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Stephen Scott.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

MLE/MAP + Naïve Bayes

Machine Learning Gaussian Naïve Bayes Big Picture

CS 188: Artificial Intelligence. Outline

Algorithms for Classification: The Basic Methods

Machine Learning (CS 567) Lecture 2

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning Overview

Machine Learning

Multivariate statistical methods and data mining in particle physics

Generative Clustering, Topic Modeling, & Bayesian Inference

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

CS 195-5: Machine Learning Problem Set 1

MLE/MAP + Naïve Bayes

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Algorithmisches Lernen/Machine Learning

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Transcription:

Generative Models CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39

Bayes decision rule Bayes theorem Generative vs. discriminative learning Two generative learning algorithms naive Bayes linear discriminant analysis Estimating models from training data maximum likelihood maximum a posteriori (MAP) Outline

Bayes Decision Rule Assumption: learning task P(X,Y) is known Question: Given instance x, how should it be classified to minimize prediction error? Bayes Decision Rule: h bayes x = argmax y Y,P Y = y X = x -

Generative vs. Discriminative Models Process: Generator: Generate descriptions according to distribution P(X). Teacher: Assigns a value to each description based on P(Y X). Training Examples x 1, y 1,, x n, y n P(X, Y) Discriminative Model Select classification rules H to consider (hypothesis space) Find h from H with lowest training error Argument: low training error leads to low prediction error Examples: SVM, decision trees, Perceptron Generative Model Select set of distributions to consider for modeling P(X,Y). Find distribution that matches P(X,Y) on training data Argument: if match close enough, we can use Bayes Decision rule Examples: naive Bayes, HMM

Bayes Theorem It is possible to switch conditioning according to the following rule Given any two random variables X and Y, it holds that Note that P Y = y X = x = P X = x Y = y P(Y = y) P(X = x) P X = x = P X = x Y = y P(Y = y) y Y

Naïve Bayes Classifier (Multivariate) Model for each class P X = x Y = +1 = P X i = x i Y = +1) Prior probabilities N i=1 N P X = x Y = 1 = P X i = x i Y = 1) i=1 P Y = +1, P(Y = 1) fever (3) cough (2) pukes (2) flu? high yes no 1 high no yes 1 low yes no -1 low yes yes 1 high no yes??? Classification rule: h naive x = argmax y *+1, 1+ P(Y = y) N i=1 P X i = x i Y = y)

Estimating the Parameters of NB Count frequencies in training data n: number of training examples n + / n - : number of pos/neg examples #(X i =x i, y): number of times feature X i takes value x i for examples in class y X i : number of values attribute X i can take Estimating P(Y) Fraction of positive / negative examples in training data P Y = +1 = n + P Y = 1 n Estimating P(X Y) Maximum Likelihood Estimate P(X i = x i Y = y) = #(X i = x i, y) Smoothing with Laplace estimate fever (3) cough (2) pukes (2) flu? high yes no 1 high no yes 1 low yes no -1 low yes yes 1 high no yes??? n y = n n P(X i = x i Y = y) = #(X i = x i, y) + 1 n y + X i

Linear Discriminant Analysis Spherical Gaussian model with unit variance for each class P X = x Y = +1)~exp 1 2 x μ + 2 Prior probabilities P X = x Y = 1)~exp 1 2 x μ 2 P Y = +1, P(Y = 1) Classification rule h LDA x = argmax y *+1, 1+ P Y = y exp 1 2 x μ y 2 argmax y *+1, 1+ log (P Y = y ) 1 2 x μ y 2 Often called Rocchio Algorithm in Information Retrieval

Estimating the Parameters of LDA Count frequencies in training data x 1, y 1,, x n, y n ~P X, Y : training data n: number of training examples n + / n - : number of positive/negative training examples Estimating P(Y) Fraction of pos / neg examples in training data P Y = +1 = n + n Estimating class means μ + = 1 n + i:y i =1 x i P Y = 1 μ = 1 n = n n i:y i = 1 x i

Naïve Bayes Classifier (Multinomial) Application: Text classification (x = (w 1,, w l ) sequence) Assumption P X = x Y = +1 = P W = w i Y = +1 Classification Rule l i=1 l P X = x Y = 1 = P W = w i Y = 1 h naive x = argmax y *+1, 1+ i=1 P(Y = y) l i=1 P W = w i Y = y)

Estimating the Parameters of Multinomial Naïve Bayes Count frequencies in training data n: number of training examples n + / n - : number of pos/neg examples #(W=w, y): number of times word w occurs in examples of class y l + / l - : total number of words in pos/neg examples V : size of vocabulary Estimating P(Y) P Y = +1 = n + P Y = 1 = n n n Estimating P(X Y) (smoothing with Laplace estimate): #(W = w, y) + 1 P(W = w Y = y) = l y + V

Test Collections Reuters-21578 Reuters newswire articles classified by topic 90 categories (multi-label) 9603 training documents / 3299 test documents (ModApte) ~27,000 features Ohsumed MeSH Medical abstracts classified by subject heading 20 categories from disease subtree (multi-label) 10,000 training documents/ 10,000 test documents ~38,000 features WebKB Collection WWW pages classified by function (e.g. personal HP, project HP) 4 categories (multi-class) 4183 training documents / 226 test documents ~38,000 features

Categories: COFFEE, CRUDE Example: Reuters Article KENYAN ECONOMY FACES PROBLEMS, PRESIDENT SAYS (Multi-Label) The Kenyan economy is heading for difficult times after a boom last year, and the country must tighten its belt to prevent the balance of payments swinging too far into deficit, President Daniel Arap Moi said. In a speech at the state opening of parliament, Moi said high coffee prices and cheap oil in 1986 led to economic growth of five pct, compared with 4.1 pct in 1985. The same factors produced a two billion shilling balance of payments surplus and inflation fell to 5.6 pct from 10.7 pct in 1985, he added. "But both these factors are no longer in our favour... As a result, we cannot expect an increase in foreign exchange reserves during the year," he said.

Categories: Example: Ohsumed Abstract Animal, Blood_Proteins/Metabolism, DNA/Drug_Effects, Mycotoxins/Toxicity,

Representing Text as Feature Vectors Vector Space Representation of text Each word is a feature Feature value is either Term Frequency TFIDF ( Wikipedia) Normalize vectors to unit length Ignore ordering of words. Google announced today that they acquired Apple for the amount equal to the gross national product of Switzerland. Google officials stated that they first wanted to buy Switzerland, but eventually were turned off by the mountains and the snowy winters 0 aardvark 1 announced 1 apple 0 get 0 goal 2 google 1 gross 0 zero 0 zilch 0 zztop

Multi-Class via One-against-rest Goal: h: X Y with Y = *1,2,3,, k+ Problem: Many classifiers can only learn binary classification rules Most common solution Learn one binary classifier for each class Put example into the class with the highest probability (or some approximation thereof) Example Binarize: (x, {2,7,9}) (x, (-1,+1,-1,-1,-1,-1,+1,-1,+1,-1)) Training: one binary classifier h i x for each class i Prediction on new x: h x = argmax h i x i 1..k Assumes that classifier outputs a confidence score and not just +1 / -1

Multi-Label via One-against-rest Goal: h: X Y with Y = 2 1,2,3,,k Problem: Many classifiers can only learn binary classification rules Most common solution Learn one binary classifier for each label Attach all labels, for which its binary classifier says positive Example Binarize: (x, {2,7,9}) (x, (-1,+1,-1,-1,-1,-1,+1,-1,+1,-1)) Training: one binary classifier h i x for each class i Prediction on new x: h x = h 1 x, h 2 x,, h k x

Contingency Table h x = +1 h x = 1 y = +1 a b y = 1 c d Commonly used performance measures ErrorRate = (b+c) / (a+b+c+d) Accuracy = (a+d) / (a+b+c+d) WeightedErrorRate = (w b *b + w c *c) / (a+b+c+d) Precision = a / (a+c) Recall = a / (a+b) F 1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Performance Measures for Text Classification Precision/Recall Curve for Binary Classification Sweep threshold from no positive classifications to only positive classifications Plot precisions vs. recall at each step of the sweep Precision/Recall Break-Even Point Intersection of PR-curve with the identity line Macro-averaging First compute the measure, then compute average Results in average over tasks Micro-averaging First average the elements of the contingency table, then compute the measure Results in average over each individual classification decision Also see http://www.scholarpedia.org/article/text_categorization.

Experiment Results Reuters Newswire 90 categories 9603 training docs 3299 test docs ~27000 features WebKB Collection 4 categories 4183 training docs 226 test docs ~38000 features Ohsumed MeSH 20 categories 10000 training docs 10000 test docs ~38000 features Microaveraged precision/recall breakeven point [0..100] Reuters WebKB Ohsumed Naïve Bayes (multinomial) 72.3 82.0 62.4 Rocchio Algorithm (LDA) 79.9 74.1 61.5 C4.5 Decision Tree 79.4 79.1 56.7 k-nearest Neighbors 82.6 80.5 63.4 SVM (linear) 87.5 90.3 71.6

Comparison of Methods for Text Classification Simplicity (conceptual) Naïve Bayes Rocchio (LDA) TDIDT C4.5 k-nn SVM ++ ++ - ++ - Efficiency at training + + -- ++ - Efficiency at prediction Handling many classes ++ ++ + -- ++ + + - ++ - Theoretical validity - - - o + Prediction accuracy - o - + ++ Stability and robustness - - -- + ++