Short Note: Naive Bayes Classifiers and Permanence of Ratios

Similar documents
Bayesian Learning (II)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Multivariate statistical methods and data mining in particle physics

Notes on Discriminant Functions and Optimal Classification

Mining Classification Knowledge

Predicting flight on-time performance

Logistic Regression. Machine Learning Fall 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

ECE521 week 3: 23/26 January 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Naïve Bayes classification

Introduction to Signal Detection and Classification. Phani Chavali

Experimental Design and Data Analysis for Biologists

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CHAPTER-17. Decision Tree Induction

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Naive Bayesian Rough Sets

The Bayes classifier

Machine Learning Gaussian Naïve Bayes Big Picture

Introduction to Machine Learning Spring 2018 Note 18

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Mining Classification Knowledge

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Logistic Regression: Regression with a Binary Dependent Variable

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning, Fall 2009: Midterm

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Data Analytics for Social Science

Introduction to Machine Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Introduction to Machine Learning

Data Mining und Maschinelles Lernen

Bayesian Decision Theory

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Machine Learning for Biomedical Engineering. Enrico Grisan

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Probabilistic Graphical Models

Bayesian Classification. Bayesian Classification: Why?

CSC 411: Lecture 09: Naive Bayes

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Learning with multiple models. Boosting.

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Lecture 9: Bayesian Learning

Day 5: Generative models, structured classification

Machine Learning

Support Vector Machines

CS6220: DATA MINING TECHNIQUES

Naive Bayes classification

Gaussian discriminant analysis Naive Bayes

Algorithms for Classification: The Basic Methods

STA 4273H: Statistical Machine Learning

An Introduction to Statistical and Probabilistic Linear Models

Bias-Variance Tradeoff

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

CS6220: DATA MINING TECHNIQUES

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Machine Learning & Data Mining

Linear discriminant functions

Advanced statistical methods for data analysis Lecture 1

Machine Learning Concepts in Chemoinformatics

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning - MT Classification: Generative Models

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Statistical Inference

Algorithmisches Lernen/Machine Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Computational Genomics

Machine Learning Algorithm. Heejun Kim

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Introduction to Machine Learning Midterm Exam

Statistical Learning. Philipp Koehn. 10 November 2015

Chapter 2 The Naïve Bayes Model in the Context of Word Sense Disambiguation

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

Nonparametric Bayesian Methods (Gaussian Processes)

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Classification 1: Linear regression of indicators, linear discriminant analysis

Data Mining Part 4. Prediction

Machine Learning Practice Page 2 of 2 10/28/13

Transcription:

Short Note: Naive Bayes Classifiers and Permanence of Ratios Julián M. Ortiz (jmo1@ualberta.ca) Department of Civil & Environmental Engineering University of Alberta Abstract The assumption of permanence of ratios allows integration of data from different sources, and with different types, and precision. It originates from the assumption of conditional independence between the different sources of information, given the event of interest. This concept is known as the Naive Bayes Classifier in statistics. Under the same assumption of conditional independence, it aims at classifying multivariate events given the conditioning information. The conditional probabilities are inferred from training data, which is similar to using a training image for inferring multiple-point statistics used to condition geostatistical simulations. We recall the theory behind these two concepts, and comment on some future avenues of research. Introduction Journel introduced the concept of permanence of ratios as an alternative to the hypothesis of data independence in geostatistical applications, when integrating information from multiple sources [5]. Consider the estimation of the probability of occurrence of an unknown event A, given information from several sources. For now, let us consider only two sources of information, B and C. Bayes law permits the calculation of the conditional probability P (A B, C): with P (A, B, C) (1) P (A, B, C) P (B A) P (C A, B) P (C A) P (B A, C) However, the calculation of this probability requires the knowledge of the joint probability between B and C, and the conditional probability of C given A and B (or equivalently, this last requirement can be replaced by the knowledge of the conditional probability of B given A and C). 1

Data Independence The assumption of data independence states that B and C are independent, therefore: P (B) P (C) Contrary to what is stated in the paper by Journel [5], this condition by itself does not resolve the problem. An additional assumption is required to simplify Expression 1: conditional independence of B and C given the event A [2], that is: These two assumptions entail: P (C A, B) P (C A) P (B A, C) P (B A) P (B) P (C) (2) Recall that: P (A, B) P (B A) Hence, Expression 2 can also be written: P (A B) P (B) (3) P (A B) P (A C) P (A B) P (A C) Although correct, this expression is not robust against departures from the assumption of independence. Consider the case when 0.1, P (A B) 0.8, and P (A C) 0.8. The conditional probability P (A B, C) is6.4. Since a probability is being estimated, a good property of the estimators would be to generate results that always fall in the interval [0, 1]. The previous assumptions do not provide an estimate (by Bayes law) that satisfies with this property. This motivates the search for alternative assumptions. Conditional Independence A less constraining and, as we will see, more robust approach is to assume the data are conditionally independent given the event A. There is no need to assume they are independent. The expression for the conditional probability of A given B and C is: (4) The joint probability between B and C is required in order to compute the conditional probability; however, as Journel pointed out, there is a simple way around this problem that gives the expression for the permanence of ratios assumption [5]. We can consider the expression for P (A B, C) and the expression for P (Ā B, C), where the event Ā represents 2

the complement of event A, thatis,theeventofa not occurring, which is equal to 1 P (A B, C). The two expressions and the ratio between them are: P (Ā) P (B Ā) P (C Ā) P (Ā B, C) P (Ā B, C) P (A B, C) P (Ā) P (B Ā) P (C Ā) (5) (6) (7) Using Expression 3, the terms in Expression 7 can be rearranged to end up with the permanence of ratios expression: P (Ā B,C) P (A B,C) P (Ā B) P (A B) P (Ā C) P (A C) P (Ā) This expression states that the incremental information provided by one event C before and after knowing another event B is constant (and vice-versa). (8) Naive Bayes Classifier Statistical classification helps in the decision to assign a given case into a class, given a set of attributes. The problem is similar to the one of integrating several sources of information (the attributes ) to decide the probability of occurrence of an unknown event (the case ). In geostatistical applications, the event A is in general an indicator that represents the presence or absence of an attribute at a given location. For continuous variables, it can be defined as the probability of being below a given threshold. By defining the conditional probability of the event A occurring given the information provided by the attributes B and C, in the case with two additional sources of information, the problem of combining knowledge from diverse sources is reduced to a statistical classification problem. The Naive Bayes classifier was introduced more than forty years ago [1, 6]. It assumes the attributes are conditionally independent, given the case being classified. The assumption is typically displayed as a Bayesian network, as depicted in Figure 1. This assumption, as seen before, entails the permanence of ratios. It is then reasonable to explore what forty years of research in statistical classification can offer to further the application of the permanence of ratios assumption in geostatistics. Comments The objective of many of the statistical and geostatistical methods is to predict an unknown or unsampled attribute [4]. This prediction can consider a continuous attribute or a categorical one. Unlike regression methods, classification techniques aim to locate the unknowns into one of several (disjoint) classes. Both methods require some rule to assign a value or a class to the unknown event, given the information available. The rule can be based on almost anything. In order to obtain the probability of an unknown event, Bayesian 3

A B C D E Figure 1: Bayesian network representing the Naive Bayes classifier with attributes B, C, D, ande. The conditional independence assumption is shown as no connectors between the attributes. statistics offers a framework to determine these rules. Classification is then simply made by assigning the event to the class with highest probability. Notwithstanding their simplicity, Naive Bayes classifiers are extremely successful in terms of accuracy, that is, when the number of incorrect versus correct assignments to classes are computed [3, 4]. This success is mainly due to the fact that they do not need to be precise in estimating the probability, but only have to discriminate correctly between the classes. Thus, the class with highest probability is selected and the case is assigned to it. Hence, there is no requirement for the estimation of the probability to be accurate, but the ranking of the probabilities has to be correct to have good classification. One of the main advantages of the assumption of conditional independence is the great simplification that it entails in the expression for the conditional probability of A given B and C. However, in a spatial context, this assumption appears as unrealistic. This problem also occurs in statistical applications, where it is common to find that many of the attributes are strongly correlated. Research in statistics has shown that the Naive Bayes assumption performs well because, despite generating a considerable bias in the estimation of the conditional probability, it generates a low inter-class error, which results in good classification performance. This has been shown in the binary case, although extensions to cases with multiple categories are likely similar [4]. Another interesting application found in the statistical literature is the use of incomplete training data sets to infer the conditional probabilities used in the Naive Bayes model. Depending on the reason for these missing data events, it is possible to infer from the available data intervals for the missing conditional probabilities. These intervals can then be used to improve classification. A simple extension to the Naive Bayes model consists on considering an additional parameter in the expression of the permanence of ratios. This parameter τ should account for the correlation between the sources of information. This implies a departure from the assumption that the incremental information of C remains constant regardless of the additional information provided by B to the estimation of the probability of A occurring: 4

This can be written equivalently as: P (Ā B,C) P (Ā C) τ P (A B,C) P (A C) P (Ā B) P (Ā) P (A B) (9) τ Expression 9 accounts for the redundancy between B and C regarding event A. Weare looking for the value of τ, such that: P (C A) τ P (C A, B) The question is now how to infer τ. A possible solution would be to look at experimental data and try to understand how τ changes with the nature of the dependence. If B and C are fully redundant, then τ 0. If they are independent, then τ 1. The behavior between these extremes could be inferred from other synthetic cases. The assumption of conditional independence opens a new avenue of research in integration of information from multiple sources. We can build from the experience in statistics. There are many potential applications and several unsolved issues that must be addressed in future research. References [1] J. A. Anderson. Diagnosis by logistic discriminant function: Further practical problems and results. Applied Statistics, 23(3):397 404, 1974. [2] P. Dawid. Conditional independence in statistical theory (With discussion). J. Roy. Statist. Soc. B, 41:1 31, 1979. [3] E.Frank,L.Trigg,G.Holmes,andI.H.Witten. Technical note: Naive bayes for regression. Machine Learning, 41:5 25, 2000. [4] J. H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1:55 77, 1997. [5] A. G. Journel. Combining knowledge from diverse sources: An alternative to traditional data independence hypotheses. Mathematical Geology, 34(5):573 596, July 2002. [6] H. Warner, H. Toronto, L. Veesey, and R. Stephenson. A mathematical approach to medical diagnosis. Journal of the American Medical Association, 177:177 183, 1961. 5