Data Mining and Analysis: Fundamental Concepts and Algorithms

Similar documents
Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining for Social and Cyber-Physical Systems

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Data Mining and Analysis: Fundamental Concepts and Algorithms

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Naïve Bayes classification

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

5. Discriminant analysis

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning for Signal Processing Bayes Classification

Introduction to Machine Learning Spring 2018 Note 18

Lecture 5: Classification

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction to Machine Learning

Linear Classification: Probabilistic Generative Models

Introduction to Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Classification 2: Linear discriminant analysis (continued); logistic regression

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Statistical Data Mining and Machine Learning Hilary Term 2016

MULTIVARIATE HOMEWORK #5

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Introduction to Machine Learning

BAYESIAN DECISION THEORY

The Bayes classifier

Classification 1: Linear regression of indicators, linear discriminant analysis

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

12 - Nonparametric Density Estimation

Bayesian Approaches Data Mining Selected Technique

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CMSC858P Supervised Learning Methods

Introduction to Machine Learning

probability of k samples out of J fall in R.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning Practice Page 2 of 2 10/28/13

Naive Bayes & Introduction to Gaussians

UVA CS / Introduc8on to Machine Learning and Data Mining

MLE/MAP + Naïve Bayes

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CSC 411: Lecture 09: Naive Bayes

MLE/MAP + Naïve Bayes

Lecture 9: Classification, LDA

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Basics on Probability. Jingrui He 09/11/2007

Lecture 9: Classification, LDA

Non-parametric Methods

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Multivariate statistical methods and data mining in particle physics

LDA, QDA, Naive Bayes

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Naive Bayes and Gaussian Bayes Classifier

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naive Bayes and Gaussian Bayes Classifier

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Does Modeling Lead to More Accurate Classification?

Naive Bayes and Gaussian Bayes Classifier

Machine Learning for Signal Processing Bayes Classification and Regression

LEC 4: Discriminant Analysis for Classification

A Study of Relative Efficiency and Robustness of Classification Methods

Bayesian Classification Methods

Naive Bayes classification

Probability and Information Theory. Sargur N. Srihari

Lecture 4 Discriminant Analysis, k-nearest Neighbors

STA 4273H: Statistical Machine Learning

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

Gaussian and Linear Discriminant Analysis; Multiclass Classification

F & B Approaches to a simple model

Linear Classification

Mining Classification Knowledge

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

COM336: Neural Computing

6.867 Machine Learning

Chapter 6 Classification and Prediction (2)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Support Vector Machines

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Curve Fitting Re-visited, Bishop1.2.5

Linear Regression and Discrimination

STA 414/2104, Spring 2014, Practice Problem Set #1

10-701/ Machine Learning - Midterm Exam, Fall 2010

CS6220: DATA MINING TECHNIQUES

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Transcription:

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 18: Probabilistic Classification Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 1 / 22

Bayes Classifier Let the training dataset D consist of n points x i in a d-dimensional space, and let y i denote the class for each point, with y i {c 1, c 2,...,c k }. The Bayes classifier estimates the posterior probability P(c i x) for each class c i, and chooses the class that has the largest probability. The predicted class for x is given as ŷ = arg max{p(c i x)} c i According to the Bayes theorem, we have P(c i x) = P(x c i) P(c i ) P(x) Because P(x) is fixed for a given point, Bayes rule can be rewritten as { } P(x ci )P(c i ) { ŷ = arg max{p(c i x)} = arg max = arg max P(x ci )P(c i ) } c i c i P(x) c i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 2 / 22

Estimating the Prior Probability: P(c i ) Let D i denote the subset of points in D that are labeled with class c i : D i = {x j D x j has class y j = c i } Let the size of the dataset D be given as D = n, and let the size of each class-specific subset D i be given as D i = n i. The prior probability for class c i can be estimated as follows: ˆP(c i ) = n i n Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 3 / 22

Estimating the Likelihood: Numeric Attributes, Parametric Approach To estimate the likelihood P(x c i ), we have to estimate the joint probability of x across all the d dimensions, i.e., we have to estimate P ( x = (x 1, x 2,...,x d ) c i ). In the parametric approach we assume that each class c i is normally distributed, and we use the estimated mean ˆµ i and covariance matrix Σ i to compute the probability density at x ˆf i (x) = ˆf(x ˆµ i, Σ 1 i ) = ( exp 2π) Σ d i The posterior probability is then given as P(c i x) = ˆf i (x)p(c i ) k j=1 ˆf j (x)p(c j ) The predicted class for x is: } ŷ = arg max {ˆf i (x)p(c i ) c i { (x ˆµ 1 T i) Σ i (x ˆµ i ) 2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 4 / 22 }

Bayes Classifier Algorithm 1 2 3 4 5 6 7 8 9 10 BAYESCLASSIFIER (D = {(x j, y j )} n j=1 ): for i = 1,...,k do D i { x j y j = c i, j = 1,...,n } // class-specific subsets n i D i // cardinality ˆP(c i ) n i /n// prior probability ˆµ i 1 n i x j D i x j // mean Z i D i 1 ni ˆµ T i // centered data Σ i 1 n i Z T i Z i // covariance matrix return ˆP(c i ), ˆµ i, Σ i for all i = 1,...,k TESTING (x and ˆP(c i ), ˆµ i, Σ i, for all i [1, k]): { ŷ arg max f(x ˆµi, Σ i ) P(c i ) } c i return ŷ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 5 / 22

Bayes Classifier: Iris Data X 1 :sepallength versus X 2 :sepal width X 2 x = (6.75, 4.25) T 4.0 3.5 3.0 2.5 2 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 6 / 22

Bayes Classifier: Categorical Attributes Let X j be a categorical attribute over the domain dom(x j ) = {a j1, a j2,...,a jmj }. Each categorical attribute X j is modeled as an m j -dimensional multivariate Bernoulli random variable X j that takes on m j distinct vector values e j1, e j2,...,e jmj, where e jr is the rth standard basis vector in R m j and corresponds to the rth value or symbol a jr dom(x j ). The entire d-dimensional dataset is modeled as the vector random variable X = (X 1, X 2,...,X d ) T. Let d = d j=1 m j; a categorical point x = (x 1, x 2,...,x d ) T is therefore represented as the d -dimensional binary vector v = v 1.. v d = e 1r1.. e drd where v j = e jrj provided x j = a jrj is the r j th value in the domain of X j. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 7 / 22

Bayes Classifier: Categorical Attributes The probability of the categorical point x is obtained from the joint probability mass function (PMF) for the vector random variable X: P(x c i ) = f(v c i ) = f ( X 1 = e 1r1,...,X d = e drd c i ) The joint PMF can be estimated directly from the data D i for each class c i as follows: ˆf(v c i ) = n i(v) n i where n i (v) is the number of times the value v occurs in class c i. However, to avoid zero probabilities we add a pseudo-count of 1 for each value ˆf(v c i ) = n i(v)+1 n i + d j=1 m j Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 8 / 22

Discretized Iris Data: sepallength andsepal width Bins Domain [4.3, 5.2] Very Short (a 11 ) (5.2, 6.1] Short (a 12 ) (6.1, 7.0] Long (a 13 ) (7.0, 7.9] Very Long (a 14 ) (a) Discretizedsepal length Bins Domain [2.0, 2.8] Short (a 21 ) (2.8, 3.6] Medium (a 22 ) (3.6, 4.4] Long (a 23 ) (b) Discretizedsepal width Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 9 / 22

Class-specific Empirical Joint Probability Mass Function Class: c 1 X 2 ˆfX1 Short (e 21 ) Medium (e 22 ) Long (e 23 ) X 1 Short (e 12 ) 0 3/50 8/50 13/50 Long (e 13 ) 0 0 0 0 Very Short (e 11) 1/50 33/50 5/50 39/50 Very Long (e 14 ) 0 0 0 0 ˆf X2 1/50 36/50 13/50 Class: c 2 X 2 ˆfX1 Short (e 21 ) Medium (e 22 ) Long (e 23 ) X 1 Short (e 12 ) 24/100 15/100 0 39/100 Long (e 13 ) 13/100 30/100 0 43/100 Very Short (e 11) 6/100 0 0 6/100 Very Long (e 14 ) 3/100 7/100 2/100 12/100 ˆf X2 46/100 52/100 2/100 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 10 / 22

Iris Data: Test Case Consider a test point x = (5.3, 3.0) T corresponding to the categorical point (Short, Medium), which is represented as v = ( e T 12 e T 22) T. The prior probabilities of the classes are ˆP(c 1 ) = 0.33 and ˆP(c 2 ) = 0.67. The likelihood and posterior probability for each class is given as ˆP(x c 1 ) = ˆf(v c 1 ) = 3/50 = 0.06 ˆP(x c 2 ) = ˆf(v c 2 ) = 15/100 = 0.15 ˆP(c 1 x) 0.06 0.33 = 0.0198 ˆP(c 2 x) 0.15 0.67 = 0.1005 In this case the predicted class is ŷ = c 2. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 11 / 22

Iris Data: Test Case with Pesudo-counts The test point x = (6.75, 4.25) T corresponds to the categorical point (Long, Long), and it is represented as v = ( e T 13 e T 23) T. Unfortunately the probability mass at v is zero for both classes. We adjust the PMF via pseudo-counts noting that the number of possible values are m 1 m 2 = 4 3 = 12. The likelihood and prior probability can then be computed as ˆP(x c 1 ) = ˆf(v c 1 ) = 0+1 50+12 = 1.61 10 2 ˆP(x c 2 ) = ˆf(v c 2 ) = 0+1 100+12 = 8.93 10 3 ˆP(c 1 x) (1.61 10 2 ) 0.33 = 5.32 10 3 ˆP(c 2 x) (8.93 10 3 ) 0.67 = 5.98 10 3 Thus, the predicted class is ŷ = c 2. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 12 / 22

Bayes Classifier: Challenges The main problem with the Bayes classifier is the lack of enough data to reliably estimate the joint probability density or mass function, especially for high-dimensional data. For numeric attributes we have to estimate O(d 2 ) covariances, and as the dimensionality increases, this requires us to estimate too many parameters. For categorical attributes we have to estimate the joint probability for all the possible values of v, given as j dom( X j ). Even if each categorical attribute has only two values, we would need to estimate the probability for 2 d values. However, because there can be at most n distinct values for v, most of the counts will be zero. Naive Bayes classifier addresses these concerns. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 13 / 22

Naive Bayes Classifier: Numeric Attributes The naive Bayes approach makes the simple assumption that all the attributes are independent, which implies that the likelihood can be decomposed into a product of dimension-wise probabilities: P(x c i ) = P(x 1, x 2,...,x d c i ) = d P(x j c i ) The likelihood for class c i, for dimension X j, is given as { } P(x j c i ) f(x j ˆµ ij,ˆσ ij) 2 1 = exp (x j ˆµ ij ) 2 2πˆσij 2ˆσ ij 2 where ˆµ ij and ˆσ 2 ij denote the estimated mean and variance for attribute X j, for class c i. j=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 14 / 22

Naive Bayes Classifier: Numeric Attributes The naive assumption corresponds to setting all the covariances to zero in Σ i, that is, σ 2 i1 0... 0 0 σi2 2... 0 Σ i =..... 0 0... σid 2 The naive Bayes classifier thus uses the sample mean ˆµ i = (ˆµ i1,..., ˆµ id ) T and a diagonal sample covariance matrix Σ i = diag(σi1 2,...,σ2 id ) for each class c i. In total 2d parameters have to be estimated, corresponding to the sample mean and sample variance for each dimension X j. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 15 / 22

Naive Bayes Algorithm NAIVEBAYES (D = {(x j, y j )} n j=1): 1 for i = 1,...,k do D i { x j y j = c i, j = 1,...,n } 2 // class-specific subsets 3 n i D i // cardinality 4 ˆP(c i ) n i /n // prior probability 5 ˆµ i 1 n i x j D i x j // mean 6 Z i = D i 1 ˆµ T i // centered data for class c i 7 for j = 1,.., d do// class-specific variance for X j 8 ˆσ ij 2 1 n i Zij T Z ij // variance ˆσ i = (ˆσ 2 i1,...,ˆσ 2 id) T // class-specific attribute variances 9 return ˆP(c i ), ˆµi, ˆσ i for all i = 1,...,k 10 11 12 TESTING (x and ˆP(c i ), ˆµ i, ˆσ i, for all i [1, k]): { d } ŷ arg max ˆP(c i ) f(x j ˆµ ij,ˆσ ij) 2 c i return ŷ j=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 16 / 22

Naive Bayes versus Full Bayes Classifier: Iris 2D Data X 1 :sepallength versus X 2 :sepalwidth X2 X2 4.0 3.5 3.0 2.5 x = (6.75, 4.25) T 4.0 3.5 3.0 2.5 x = (6.75, 4.25) T 2 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 X1 2 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 X1 (a) Naive Bayes (b) Full Bayes Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 17 / 22

Naive Bayes: Categorical Attributes The independence assumption leads to a simplification of the joint probability mass function d d P(x c i ) = P(x j c i ) = f ( ) X j = e jrj c i j=1 where f(x j = e jrj c i ) is the probability mass function for X j, which can be estimated from D i as follows: j=1 ˆf(v j c i ) = n i(v j ) n i where n i (v j ) is the observed frequency of the value v j = e j r j corresponding to the r j th categorical value a jrj for the attribute X j for class c i. If the count is zero, we can use the pseudo-count method to obtain a prior probability. The adjusted estimates with pseudo-counts are given as where m j = dom(x j ). ˆf(v j c i ) = n i(v j )+1 n i + m j Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 18 / 22

Nonparametric Approach: K Nearest Neighbors Classifier We consider a non-parametric approach for likelihood estimation using the nearest neighbors density estimation. Let D be a training dataset comprising n points x i R d, and let D i denote the subset of points in D that are labeled with class c i, with n i = D i. Given a test point x R d, and K, the number of neighbors to consider, let r denote the distance from x to its K th nearest neighbor in D. Consider the d-dimensional hyperball of radius r around the test point x, defined as B d (x, r) = { x i D δ(x, x i ) r } Here δ(x, x i ) is the distance between x and x i, which is usually assumed to be the Euclidean distance, i.e., δ(x, x i ) = x x i 2. We assume that B d (x, r) = K. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 19 / 22

Nonparametric Approach: K Nearest Neighbors Classifier Let K i denote the number of points among the K nearest neighbors of x that are labeled with class c i, that is K i = { x j B d (x, r) y j = c i } The class conditional probability density at x can be estimated as the fraction of points from class c i that lie within the hyperball divided by its volume, that is ˆf(x c i ) = K i/n i V = K i n i V where V = vol(b d (x, r)) is the volume of the d-dimensional hyperball. The posterior probability P(c i x) can be estimated as P(c i x) = ˆf(x c i )ˆP(c i ) k j=1 ˆf(x c j )ˆP(c j ) However, because ˆP(c i ) = n i n, we have ˆf(x c i )ˆP(c i ) = K i n i V ni n = K i nv Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 20 / 22

Nonparametric Approach: K Nearest Neighbors Classifier The posterior probability is given as P(c i x) = K i nv k K j j=1 nv = K i K Finally, the predicted class for x is { } Ki ŷ = arg max{p(c i x)} = arg max = arg max{k i } c i c i K c i Because K is fixed, the KNN classifier predicts the class of x as the majority class among its K nearest neighbors. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 21 / 22

Iris Data: K Nearest Neighbors Classifier 4.0 3.5 3.0 2.5 X 2 2 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 r x = (6.75, 4.25) T X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 22 / 22