Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 18: Probabilistic Classification Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 1 / 22

Bayes Classifier Let the training dataset D consist of n points x i in a d-dimensional space, and let y i denote the class for each point, with y i {c 1, c 2,...,c k }. The Bayes classifier estimates the posterior probability P(c i x) for each class c i, and chooses the class that has the largest probability. The predicted class for x is given as ŷ = arg max{p(c i x)} c i According to the Bayes theorem, we have P(c i x) = P(x c i) P(c i ) P(x) Because P(x) is fixed for a given point, Bayes rule can be rewritten as { } P(x ci )P(c i ) { ŷ = arg max{p(c i x)} = arg max = arg max P(x ci )P(c i ) } c i c i P(x) c i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 2 / 22

Estimating the Prior Probability: P(c i ) Let D i denote the subset of points in D that are labeled with class c i : D i = {x j D x j has class y j = c i } Let the size of the dataset D be given as D = n, and let the size of each class-specific subset D i be given as D i = n i. The prior probability for class c i can be estimated as follows: ˆP(c i ) = n i n Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 3 / 22

Estimating the Likelihood: Numeric Attributes, Parametric Approach To estimate the likelihood P(x c i ), we have to estimate the joint probability of x across all the d dimensions, i.e., we have to estimate P ( x = (x 1, x 2,...,x d ) c i ). In the parametric approach we assume that each class c i is normally distributed, and we use the estimated mean ˆµ i and covariance matrix Σ i to compute the probability density at x ˆf i (x) = ˆf(x ˆµ i, Σ 1 i ) = ( exp 2π) Σ d i The posterior probability is then given as P(c i x) = ˆf i (x)p(c i ) k j=1 ˆf j (x)p(c j ) The predicted class for x is: } ŷ = arg max {ˆf i (x)p(c i ) c i { (x ˆµ 1 T i) Σ i (x ˆµ i ) 2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 4 / 22 }

Bayes Classifier Algorithm 1 2 3 4 5 6 7 8 9 10 BAYESCLASSIFIER (D = {(x j, y j )} n j=1 ): for i = 1,...,k do D i { x j y j = c i, j = 1,...,n } // class-specific subsets n i D i // cardinality ˆP(c i ) n i /n// prior probability ˆµ i 1 n i x j D i x j // mean Z i D i 1 ni ˆµ T i // centered data Σ i 1 n i Z T i Z i // covariance matrix return ˆP(c i ), ˆµ i, Σ i for all i = 1,...,k TESTING (x and ˆP(c i ), ˆµ i, Σ i, for all i [1, k]): { ŷ arg max f(x ˆµi, Σ i ) P(c i ) } c i return ŷ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 5 / 22

Bayes Classifier: Iris Data X 1 :sepallength versus X 2 :sepal width X 2 x = (6.75, 4.25) T 4.0 3.5 3.0 2.5 2 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 6 / 22

Bayes Classifier: Categorical Attributes Let X j be a categorical attribute over the domain dom(x j ) = {a j1, a j2,...,a jmj }. Each categorical attribute X j is modeled as an m j -dimensional multivariate Bernoulli random variable X j that takes on m j distinct vector values e j1, e j2,...,e jmj, where e jr is the rth standard basis vector in R m j and corresponds to the rth value or symbol a jr dom(x j ). The entire d-dimensional dataset is modeled as the vector random variable X = (X 1, X 2,...,X d ) T. Let d = d j=1 m j; a categorical point x = (x 1, x 2,...,x d ) T is therefore represented as the d -dimensional binary vector v = v 1.. v d = e 1r1.. e drd where v j = e jrj provided x j = a jrj is the r j th value in the domain of X j. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 7 / 22

Bayes Classifier: Categorical Attributes The probability of the categorical point x is obtained from the joint probability mass function (PMF) for the vector random variable X: P(x c i ) = f(v c i ) = f ( X 1 = e 1r1,...,X d = e drd c i ) The joint PMF can be estimated directly from the data D i for each class c i as follows: ˆf(v c i ) = n i(v) n i where n i (v) is the number of times the value v occurs in class c i. However, to avoid zero probabilities we add a pseudo-count of 1 for each value ˆf(v c i ) = n i(v)+1 n i + d j=1 m j Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 8 / 22

Discretized Iris Data: sepallength andsepal width Bins Domain [4.3, 5.2] Very Short (a 11 ) (5.2, 6.1] Short (a 12 ) (6.1, 7.0] Long (a 13 ) (7.0, 7.9] Very Long (a 14 ) (a) Discretizedsepal length Bins Domain [2.0, 2.8] Short (a 21 ) (2.8, 3.6] Medium (a 22 ) (3.6, 4.4] Long (a 23 ) (b) Discretizedsepal width Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 9 / 22

Class-specific Empirical Joint Probability Mass Function Class: c 1 X 2 ˆfX1 Short (e 21 ) Medium (e 22 ) Long (e 23 ) X 1 Short (e 12 ) 0 3/50 8/50 13/50 Long (e 13 ) 0 0 0 0 Very Short (e 11) 1/50 33/50 5/50 39/50 Very Long (e 14 ) 0 0 0 0 ˆf X2 1/50 36/50 13/50 Class: c 2 X 2 ˆfX1 Short (e 21 ) Medium (e 22 ) Long (e 23 ) X 1 Short (e 12 ) 24/100 15/100 0 39/100 Long (e 13 ) 13/100 30/100 0 43/100 Very Short (e 11) 6/100 0 0 6/100 Very Long (e 14 ) 3/100 7/100 2/100 12/100 ˆf X2 46/100 52/100 2/100 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 10 / 22

Iris Data: Test Case Consider a test point x = (5.3, 3.0) T corresponding to the categorical point (Short, Medium), which is represented as v = ( e T 12 e T 22) T. The prior probabilities of the classes are ˆP(c 1 ) = 0.33 and ˆP(c 2 ) = 0.67. The likelihood and posterior probability for each class is given as ˆP(x c 1 ) = ˆf(v c 1 ) = 3/50 = 0.06 ˆP(x c 2 ) = ˆf(v c 2 ) = 15/100 = 0.15 ˆP(c 1 x) 0.06 0.33 = 0.0198 ˆP(c 2 x) 0.15 0.67 = 0.1005 In this case the predicted class is ŷ = c 2. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 11 / 22

Iris Data: Test Case with Pesudo-counts The test point x = (6.75, 4.25) T corresponds to the categorical point (Long, Long), and it is represented as v = ( e T 13 e T 23) T. Unfortunately the probability mass at v is zero for both classes. We adjust the PMF via pseudo-counts noting that the number of possible values are m 1 m 2 = 4 3 = 12. The likelihood and prior probability can then be computed as ˆP(x c 1 ) = ˆf(v c 1 ) = 0+1 50+12 = 1.61 10 2 ˆP(x c 2 ) = ˆf(v c 2 ) = 0+1 100+12 = 8.93 10 3 ˆP(c 1 x) (1.61 10 2 ) 0.33 = 5.32 10 3 ˆP(c 2 x) (8.93 10 3 ) 0.67 = 5.98 10 3 Thus, the predicted class is ŷ = c 2. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 12 / 22

Bayes Classifier: Challenges The main problem with the Bayes classifier is the lack of enough data to reliably estimate the joint probability density or mass function, especially for high-dimensional data. For numeric attributes we have to estimate O(d 2 ) covariances, and as the dimensionality increases, this requires us to estimate too many parameters. For categorical attributes we have to estimate the joint probability for all the possible values of v, given as j dom( X j ). Even if each categorical attribute has only two values, we would need to estimate the probability for 2 d values. However, because there can be at most n distinct values for v, most of the counts will be zero. Naive Bayes classifier addresses these concerns. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 13 / 22

Naive Bayes Classifier: Numeric Attributes The naive Bayes approach makes the simple assumption that all the attributes are independent, which implies that the likelihood can be decomposed into a product of dimension-wise probabilities: P(x c i ) = P(x 1, x 2,...,x d c i ) = d P(x j c i ) The likelihood for class c i, for dimension X j, is given as { } P(x j c i ) f(x j ˆµ ij,ˆσ ij) 2 1 = exp (x j ˆµ ij ) 2 2πˆσij 2ˆσ ij 2 where ˆµ ij and ˆσ 2 ij denote the estimated mean and variance for attribute X j, for class c i. j=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 14 / 22

Naive Bayes Classifier: Numeric Attributes The naive assumption corresponds to setting all the covariances to zero in Σ i, that is, σ 2 i1 0... 0 0 σi2 2... 0 Σ i =..... 0 0... σid 2 The naive Bayes classifier thus uses the sample mean ˆµ i = (ˆµ i1,..., ˆµ id ) T and a diagonal sample covariance matrix Σ i = diag(σi1 2,...,σ2 id ) for each class c i. In total 2d parameters have to be estimated, corresponding to the sample mean and sample variance for each dimension X j. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 15 / 22

Naive Bayes Algorithm NAIVEBAYES (D = {(x j, y j )} n j=1): 1 for i = 1,...,k do D i { x j y j = c i, j = 1,...,n } 2 // class-specific subsets 3 n i D i // cardinality 4 ˆP(c i ) n i /n // prior probability 5 ˆµ i 1 n i x j D i x j // mean 6 Z i = D i 1 ˆµ T i // centered data for class c i 7 for j = 1,.., d do// class-specific variance for X j 8 ˆσ ij 2 1 n i Zij T Z ij // variance ˆσ i = (ˆσ 2 i1,...,ˆσ 2 id) T // class-specific attribute variances 9 return ˆP(c i ), ˆµi, ˆσ i for all i = 1,...,k 10 11 12 TESTING (x and ˆP(c i ), ˆµ i, ˆσ i, for all i [1, k]): { d } ŷ arg max ˆP(c i ) f(x j ˆµ ij,ˆσ ij) 2 c i return ŷ j=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 16 / 22

Naive Bayes versus Full Bayes Classifier: Iris 2D Data X 1 :sepallength versus X 2 :sepalwidth X2 X2 4.0 3.5 3.0 2.5 x = (6.75, 4.25) T 4.0 3.5 3.0 2.5 x = (6.75, 4.25) T 2 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 X1 2 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 X1 (a) Naive Bayes (b) Full Bayes Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 17 / 22

Naive Bayes: Categorical Attributes The independence assumption leads to a simplification of the joint probability mass function d d P(x c i ) = P(x j c i ) = f ( ) X j = e jrj c i j=1 where f(x j = e jrj c i ) is the probability mass function for X j, which can be estimated from D i as follows: j=1 ˆf(v j c i ) = n i(v j ) n i where n i (v j ) is the observed frequency of the value v j = e j r j corresponding to the r j th categorical value a jrj for the attribute X j for class c i. If the count is zero, we can use the pseudo-count method to obtain a prior probability. The adjusted estimates with pseudo-counts are given as where m j = dom(x j ). ˆf(v j c i ) = n i(v j )+1 n i + m j Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 18 / 22

Nonparametric Approach: K Nearest Neighbors Classifier We consider a non-parametric approach for likelihood estimation using the nearest neighbors density estimation. Let D be a training dataset comprising n points x i R d, and let D i denote the subset of points in D that are labeled with class c i, with n i = D i. Given a test point x R d, and K, the number of neighbors to consider, let r denote the distance from x to its K th nearest neighbor in D. Consider the d-dimensional hyperball of radius r around the test point x, defined as B d (x, r) = { x i D δ(x, x i ) r } Here δ(x, x i ) is the distance between x and x i, which is usually assumed to be the Euclidean distance, i.e., δ(x, x i ) = x x i 2. We assume that B d (x, r) = K. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 19 / 22

Nonparametric Approach: K Nearest Neighbors Classifier Let K i denote the number of points among the K nearest neighbors of x that are labeled with class c i, that is K i = { x j B d (x, r) y j = c i } The class conditional probability density at x can be estimated as the fraction of points from class c i that lie within the hyperball divided by its volume, that is ˆf(x c i ) = K i/n i V = K i n i V where V = vol(b d (x, r)) is the volume of the d-dimensional hyperball. The posterior probability P(c i x) can be estimated as P(c i x) = ˆf(x c i )ˆP(c i ) k j=1 ˆf(x c j )ˆP(c j ) However, because ˆP(c i ) = n i n, we have ˆf(x c i )ˆP(c i ) = K i n i V ni n = K i nv Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 20 / 22

Nonparametric Approach: K Nearest Neighbors Classifier The posterior probability is given as P(c i x) = K i nv k K j j=1 nv = K i K Finally, the predicted class for x is { } Ki ŷ = arg max{p(c i x)} = arg max = arg max{k i } c i c i K c i Because K is fixed, the KNN classifier predicts the class of x as the majority class among its K nearest neighbors. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 21 / 22

Iris Data: K Nearest Neighbors Classifier 4.0 3.5 3.0 2.5 X 2 2 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 r x = (6.75, 4.25) T X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 18: Probabilistic Classification 22 / 22