Evaluation for sets of classes

Similar documents
Learning from Data 1 Naive Bayes

Machine learning: Density estimation

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

SDMML HT MSc Problem Sheet 4

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Expectation Maximization Mixture Models HMMs

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Homework Assignment 3 Due in class, Thursday October 15

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Generative classification models

Boostrapaggregating (Bagging)

15-381: Artificial Intelligence. Regression and cross validation

Test Data: Classes: Training Data:

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Which Separator? Spring 1

CS47300: Web Information Search and Management

Retrieval Models: Language models

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Ensemble Methods: Boosting

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Lecture Nov

Course 395: Machine Learning - Lectures

Classification learning II

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology

Mixture o f of Gaussian Gaussian clustering Nov

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Kernel Methods and SVMs Extension

MDL-Based Unsupervised Attribute Ranking

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Bayesian Networks. Course: CS40022 Instructor: Dr. Pallab Dasgupta

Maximum Likelihood Estimation (MLE)

University of Washington Department of Chemistry Chemistry 453 Winter Quarter 2015

CHAPTER 3: BAYESIAN DECISION THEORY

Classification as a Regression Problem

Manning & Schuetze, FSNLP (c)1999, 2001

Logistic Classifier CISC 5800 Professor Daniel Leeds

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Natural Language Processing and Information Retrieval

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Bayesian predictive Configural Frequency Analysis

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Classification Bayesian Classifiers

9.2 Maximum A Posteriori and Maximum Likelihood

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Composite Hypotheses testing

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Speech and Language Processing

CSC 411 / CSC D11 / CSC C11

9 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

Statistical pattern recognition

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Classification with Reject Option in Text Categorisation Systems

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

The big picture. Outline

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Need for Probabilistic Reasoning. Raymond J. Mooney. Conditional Probability. Axioms of Probability Theory. Classification (Categorization)

Lecture 12: Classification

Week 5: Neural Networks

ML4NLP Introduction to Classification

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

Information Retrieval Language models for IR

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Linear Regression Analysis: Terminology and Notation

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

Note on EM-training of IBM-model 1

Excess Error, Approximation Error, and Estimation Error

Introduction to Hidden Markov Models

Conjugacy and the Exponential Family

Generalized Linear Methods

Hidden Markov Models

Chapter 8 Indicator Variables

Naïve Bayes Classifier

Corpora and Statistical Methods Lecture 6. Semantic similarity, vector space models and wordsense disambiguation

Multilayer Perceptron (MLP)

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 12: Bayes Classifiers. Dr. Yanjun Qi. University of Virginia

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

A be a probability space. A random vector

Information Retrieval (Text Categorization)

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Estimation of the Probability of Success Based on Communication History

Basically, if you have a dummy dependent variable you will be estimating a probability.

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Bayesian Decision Theory

Complex Numbers. x = B B 2 4AC 2A. or x = x = 2 ± 4 4 (1) (5) 2 (1)

Transcription:

Evaluaton for Tet Categorzaton Classfcaton accuracy: usual n ML, the proporton of correct decsons, Not approprate f the populaton rate of the class s low Precson, Recall and F 1 Better measures 21 Evaluaton for sets of classes How can we combne evaluaton w.r.t. sngle classes nto an evaluaton for predcton over multple classes? Two aggregate measures Macro-Averagng, computes a smple average over the classes of the precson, recall, and F 1 measures Mcro-Averagng, pools per-doc decsons across classes and then compute precson, recall, and F 1 on the pooled contngency table 22

Macro and Mcro Averagng Macro-averagng gves the same weght to each class Mcro-averagng gves the same weght to each per-doc decson 23 Eample Class 1 Class 2 POOLED Pred: yes Pred: no Truth: yes 10 10 Truth: no 10 970 Pred: yes Pred: no Truth: yes 90 10 Truth: no 10 890 Pred: yes Pred: no Truth: yes 100 20 Truth: no 20 1860 Macro-Averaged Precson: (.5+.9/2 =.7 Mcro-averaged Precson: 100/120 =.833 24

Benchmark Collectons (used n Tet Categorzaton Reuters-21578 The most wdely used n tet categorzaton. It conssts of newswre artcles whch are labeled wth some number of topcal classfcatons (zero or more out of 115 classes. 9603 tran + 3299 test documents Reuters RCV1 Newstores, larger than the prevous (about 810K documents and a herarchcally structured set of (103 leaf classes Oshumed a ML set of 348K docs classfed under a herarchcally structured set of 14K classes (MESH thesaurus. Ttle+abstracts of scentfc medcal papers. 20 Newsgroups 18491 artcles from the 20 Usenet newsgroups 25 The nductve constructon of classfers 26

Two dfferent phases to buld a classfer h for category c C 1. Defnton of a functon CSV : D R, a categorzaton status value, representng the strength of the evdence that a gven document d belongs to c 2. Defnton of a threshold τ such that CSV (d τ nterpreted as a decson to classfy d under c CSV (d τ nterpreted as a decson not to classfy d under c 27 CSV and Proportonal thresholdng Two dfferent ways to determne the thresholds τ once gven CSV are [Yang01] 1. CSV thresholdng: τ s a value returned by the CSV functon. May or may not be equal for all the categores. Obtaned on a valdaton set 2. Proportonal thresholdng: τ are the values such that the valdaton set frequences for each class s as close as possble to the same frequences n the tranng set CSV thresholdng s theoretcally better motvated, and generally produce superor effectveness, but computatonally more epansve Thresholdng s needed only for hard classfcaton. In soft classfcaton the decson s taken by the epert, and the CSV scores can be used for rankng purposes 28

Probablstc Classfers Probablstc classfers vew CSV (d n terms of P(c d, and compute t by means of the Bayes theorem P(c d = P(d c P(c /P(d Mamum a posteror Hypothesys (MAP argma P(c d Classes are vewed as generators of documents The pror probablty P(c s the probablty that a document d s n c 29 Nave Bayes Classfers Task: Classfy a new nstance D based on a tuple of attrbute values D = nto one of the 1, 2, K, n classes c C c MAP = argma P( c, 2, K, c C 1 n = argma c C P( 1, 2 P(, K, 1, 2 n c, K, P( c n = argma P(, 2, K, c C c P( c 1 n 30

Naïve Bayes Classfer: Assumpton P(c Can be estmated from the frequency of classes n the tranng eamples. P( 1, 2,, n c O( n C parameters Could only be estmated f a very, very large number of tranng eamples was avalable. Naïve Bayes Condtonal Independence Assumpton: Assume that the probablty of observng the conuncton of attrbutes s equal to the product of the ndvdual probabltes P( c. 31 The Naïve Bayes Classfer Flu 1 2 3 4 5 runnynose snus cough fever muscle-ache Condtonal Independence Assumpton: features are ndependent of each other gven the class: P ( 1, K, 5 = P( 1 P( 2 L P( 5 Only n C parameters (+ C to estmate 32

Learnng the Model C 1 2 3 4 5 6 mamum lkelhood estmates: most lkely value of each parameter gven the tranng data.e. smply use the frequences n the data Pˆ( c = N( C = c N Pˆ( c = N( =, C = c N( C = c 33 Problem wth Ma Lkelhood Flu ( 1, K, 5 = P( 1 P( 2 L P( 5 What f we have seen no tranng cases where patent had no flu and muscle aches? P P ( Zero probabltes cannot be condtoned away, no matter the other evdence! l = arg ma c P ˆ( c Pˆ( c 1 2 3 4 5 runnynose snus cough fever N ( = t, C = nf N ( C = nf ˆ 5 5 = t C = nf = = muscle-ache 0 34

Smoothng to Avod Overfttng Pˆ( c = N( =, C = c + 1 N( C = c + k # of values of 35 Stochastc Language Models Models probablty of generatng strngs (each word n turn n the language. Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 sad 0.02 lkes the man lkes the woman 0.2 0.01 0.02 0.2 0.01 multply P(s M = 0.00000008 36

Stochastc Language Models Model probablty of generatng any strng Model M1 Model M2 0.2 the 0.01 class 0.2 the 0.0001 class the class pleaseth yon maden 0.0001 sayst 0.0001 pleaseth 0.03 sayst 0.02 pleaseth 0.2 0.2 0.01 0.0001 0.0001 0.0001 0.02 0.1 0.0005 0.01 0.0001 yon 0.1 yon 0.0005 maden 0.01 woman 0.01 maden 0.0001 woman P(s M2 > P(s M1 37 Two Models Model 1: Multvarate bnomal One feature w for each word n dctonary w = true n document d f w appears n d Nave Bayes assumpton: Gven the document s topc, appearance of one word n the document tells us nothng about chances that another word appears Ths s the model you get from bnary ndependence model n probablstc relevance feedback n handclassfed data 38

Two Models Model 2: Multnomal One feature for each word pos n document feature s values are all words n dctonary Value of s the word n poston Naïve Bayes assumpton: Gven the document s topc, word n one poston n the document tells us nothng about words n other postons Second assumpton: Word appearance does not depend on poston P( = w c = P( w c = 39 Parameter estmaton Bnomal model: Pˆ( = t w c = fracton of documents of topc c n whch word w appears Multnomal model: Pˆ( = w c = fracton of tmes n whch word w appears across all documents of topc c 40

Naïve Bayes: Learnng From tranng corpus, etract Vocabulary Calculate requred P(c and P( k c terms For each c n C do docs subset of documents for whch the target class s c P( c total docs # documents Tet sngle document contanng all docs for each word k n Vocabulary n k number of occurrences of k n Tet nk + α P( c k α n + Vocabulary 41 Naïve Bayes: Classfyng postons all word postons n current document whch contan tokens found n Vocabulary Return c NB, where c C c = argma P( c P( c NB postons 42

Nave Bayes: Tme Complety Tranng Tme: O( D L d + C V where L d s the average length of a document n D. Assumes V and all D, n, and n pre-computed n O( D L d tme durng one pass through all of the data. Generally ust O( D L d snce usually C V < D L d Test Tme: O( C L t where L t s the average length of a test document. Very effcent overall, lnearly proportonal to the tme needed to ust read n all the data. 43 Underflow Preventon Multplyng lots of probabltes, whch are between 0 and 1 by defnton, can result n floatng-pont underflow. Snce log(y = log( + log(y, t s better to perform all computatons by summng logs of probabltes rather than multplyng probabltes. Class wth hghest fnal un-normalzed log probablty score s stll the most probable. c C c = argma log P( c + log P( c NB postons 44