ISSN X On misclassification probabilities of linear and quadratic classifiers

Size: px

Start display at page:

Download "ISSN X On misclassification probabilities of linear and quadratic classifiers"

Prosper Nicholson
6 years ago
Views:

Afrika Statistika Vol. 111, 016, pages 943

1 Afrika Statistika Vol. 111, 016, pages DOI: Afrika Statistika ISSN X On misclassification probabilities of linear and quadratic classifiers Olusola Samuel Makinde Department of Statistics, Federal University of Technology, Akure, Nigeria Received March 17, 016; Accepted May 0, 016 Copyright c 016, Afrika Statistika and Statistics and Probability African Society SPAS. All rights reserved Abstract. We study the theoretical misclassification probability of linear and quadratic classifiers and examine the performance of these classifiers under distributional variations in theory and using simulation. We derive expression for Bayes errors for some competing distributions from the same family under location shift. Résumé. Nous étudions les probabilités théoriques de malclassification de méthodes de classification linéaires et quadratiques. Ensuite, nous examinons les performance de ces méthodes de classifications selon différentes distributions, sur le plan théorique en les étayant avec des études de simulations. Nous exprimons l erreur de Bayes pour des distributions de même famille en compétition selon le changement du paramètre de centralisation. Key words: Misclassification probability; linear classifier; quadratic classifier; Bayes risk. AMS 010 Mathematics Subject Classification : 6H30; 60E Introduction Classification is aimed at getting maximum information about separability or distinction among classes or populations and then assigns each observation to one of these populations on the basis of a vector of measurements or features. It has many important applications in different fields, such as disease diagnosis in medical sciences, risk identification in finance, admission of prospective students into university based on a battery of tests among others. Anderson 1984 described classification problem as the problem of statistical decision making. A good classification procedure is the one that classifies observations from unknown populations correctly. Suppose competing populations have well defined distributions which are characterised by some location and scale parameters. Classification of observations to any of these populations can be viewed from this characterisation in terms of shift in location Corresponding author Olusola Samuel Makinde : osmakinde@futa.edu.ng

2 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers. 944 and scale of each of the population distributions. Competing populations may have either location shift, scale shift or both location-scale shift. Consider populations π j, j = 1,,..., J from multivariate distributions, F j having probability density functions f j with prior probabilities p j. Bayes rule, proposed in Welch 1939, is to assign each observation x to population π j, whose posterior probability P π j x is the highest. It assigns x to population π 1, in a two class problem, if f 1 xp 1 f xp > 1 and to π otherwise. Wald 1944 argued that if each population has a cost, Ci j associated with misclassifying x whose true population is π j into π i, then assign observations to the class or population that has the highest expected cost of misclassification. Welch 1939 showed that for any two normally distributed populations, the ratio of log likelihood functions of the two populations is the theoretical basis for building discriminant function that best classifies new individuals to any of the two populations given that the prior probabilities of the populations are known. For two competing populations whose principal difference is in location, Fisher 1936 described the separation between these two populations to be ratio of variance between the populations to variance within the populations. This postulation leads to discriminant analysis, called Fisher s discriminant analysis. Suppose there are two populations from the same family of multivariate distributions to which observations can be classified. If these populations are normally distributed and have the same covariance matrix, the discriminant analysis is referred to as linear discriminant analysis LDA. Similarly, if these populations are normally distributed but have different covariance matrices, the optimal rule is nonlinear and referred to as quadratic discriminant analysis QDA. Welch 1939 and Wald 1944 showed that linear discriminant function has optimal properties for two group classification if the populations are multivariate normally distributed. Krzanowski 1977 reviewed the performance of Fisher s linear discriminant function when underlying assumptions are violated. Many classification methods, both parametric and non-parametric, have been compared with LDA and QDA under normality and non-normality which include Ghosh and Chaudhuri 005, Kim et al. 011 and Li et al. 01 among others. One way of evaluating the performance of a classification rule is to calculate its misclassification probabilities. One can define the total probability of misclassification as = p 1 f 1 xdx + p f xdx, R R 1 where R 1 = { x R d : f 1x f x c1 p } { and R = x R d : f 1x c 1p 1 f x < c1 p }. c 1p 1 The classification regions R 1 and R can be constructed only when the distributions F and G are fully known Makinde and Chakraborty, 015. This will rarely be the case, we have to work with the empirical versions of the classification regions and calculate the error rates. In this paper, we study the misclassification probabilities of linear and quadratic classifiers with emphasis to multivariate normal distributions and deduce the mathematical expression for Bayes error for some multivariate distributions under suitable conditions.

3 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers Classification Rules Based on Normality We consider J populations having density function of the form f j x = gx µ j Σ 1 j x µ j, x R d, j = 1,..., J, for some strictly decreasing, continuous, non-negative scalar function g, where µ j and Σ j are mean vector and covariance matrix of jth population respectively. Assuming normality, equal prior probabilities and equal cost of misclassification for the J populations, Bayes rule can be defined as assign x to π k if D k x, µ k, Σ k = min 1 j J D jx, µ j, Σ j 1 where D j x, µ j, Σ j = x µ j Σ 1 j x µ j + log Σ j. This classification rule is known as quadratic discriminant analysisqda. It is linear if Σ 1 = Σ =... = Σ J. Define R k, the region of classification into kth population, as R k = {X : D k X, µ k, Σ k = min 1 j J D jx, µ j, Σ j } Then the probability of misclassification of the optimal rule in 1 is = J p j P x / R j x π j. j=1 A good classification method is the one that minimises. Let us consider two classes for simplicity. Suppose π i Nµ i, Σ i, i = 1,. The classification in 1 can be expressed as assign x to π 1 if µ 1 µ Σ 1 µ 1 + µ x > 0 3 if For Σ 1 = Σ = Σ. Observe that if Σ is a constant multiple of identity matrix, 3 is the generalisation of component-wise centroid classifier in see, Hall et al., 009, page Then, the probability of misclassification of x into either π 1 or π is = Φ c 0, where c 0 = µ 1 µ Σ 1 µ 1 µ and Φ is the cumulative distribution function of the standard normal distribution. See Johnson and Wichern 007 for further discussion. To illustrate the probability of misclassification, let π 1 and π be two d-variate normal populations with mean vector and covariance matrix, µ 1, Σ 1 and µ, Σ respectively. Assume that the prior probabilities of π 1 and π are equal. Consider µ 1 = 0, 0, µ = δ, 0 and Σ 1 = Σ = I. The total probability of misclassification associated with LDA is a function of non-centrality parameter δ and is obtained as = Φ δ When covariance matrix of a population is a scalar multiple of the other. The following results hold: Theorem 1. Let F and G be two competing distributions with prior probabilities p 1 and p respectively. Suppose F Nµ 1, Σ 1 and G Nµ, Σ. Take µ 1 = µ = 0, 0, Σ 1 = I, Σ = σ I for σ 1 and p 1 = p = 0.5. Then.

4 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers for σ > 1 = 1 σ 1 F σ 1 log e σ ] + F σ 1 log e σ where F. denotes distribution function of central Chi-square distribution with degrees of freedom.. for 0 < σ < 1 = 1 ] σ 1 + F σ 1 log e σ F σ 1 log e σ Proof of Theorem 1: For x R d, if x Nµ 1, Σ 1, x x χ d and if x Nµ, Σ, x x σ χ d. f 1x/f x 1 implies x µ 1 Σ 1 1 x µ 1 x µ Σ 1 x µ log e Σ log e Σ 1. This gives σ 1 σ x x log e σ. For σ > 0, we consider two cases. These are σ > 1 and σ < When σ > 1, the region of classification is R 1 : x x σ σ 1 log e σ and R : x x > σ σ 1 log e σ. Define p 1 P 1 as probability that x comes from population π 1 but eventually falls in the region of classification into population π and p P 1 as probability that x comes from population π but eventually falls in the region of classification into population π 1. Then P 1 = P x x > σ σ 1 log e σ x σ x χ = 1 F σ 1 log e σ, P 1 = P x x σ σ 1 log e σ x x σ χ and, probability of misclassification is = 1 σ 1 F σ 1 log e σ. When σ < 1, the region of classification is = F σ 1 log e σ ] + F σ 1 log e σ. R 1 : x x σ σ 1 log e σ and R : x x < σ σ 1 log e σ. P 1 = P x x < σ σ 1 log e σ x x χ P 1 = P x x σ σ 1 log e σ x x σ χ The probability of misclassification is = 1 σ 1 + F σ 1 log e σ σ = F σ 1 log e σ = 1 F σ 1 log e σ ] F σ 1 log e σ..

5 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers. 947 The results in Theorem 1 are compared with empirical results based on simulation. The procedure for the simulation follows from Section 3. The mean vectors and covariance matrices of competing distributions are taken to be µ 1 = µ = 0, 0, Σ1 = I, Σ = σ I for σ 1. The numerical results are presented in Figure 1b. Theorem. Suppose the conditions of Theorem 1 hold and take µ 1 = and Σ 1 = I, Σ = σ I, then { p1 P χ f = 1 > k c 1 + p P χ f < k c, for σ > 1 p 1 P χ f 1 < k c 1 + p P χ f > k c, for σ < 1 where k = log e σ + 1 δ σ σ 1, c i = σ i, f i = µ i, i = 1,, µ i c i Σ = I + σ I, A = I, υ = Σ 1 µ1 µ, 0, µ 0 = U = υ Συ = µ 1 µ µ 1 µ = δ, µ 1 = 1 { σ ]δ } σ σ + σ 1, µ 1 = 1 { 1 σ 1 + σ ]δ } σ + σ 1, 1 σ 1 = 1 σ { σ ]δ + σ 1 }, σ = 1 σ 1 + σ ]δ + σ 1. The proof of Theorem follows from Gilbert δ 0 3. Numerical Examples As illustration of actual error rates of LDA and QDA, we present a small simulation study. Let us consider the two populations π 1 and π to be bivariate spherically symmetric with centre of symmetries µ 1 = 0, 0 and µ = δ, 0, respectively. The sample sizes for X 1,..., X n from π 1 and Y 1,..., Y m from π are taken to be n = m = 100. We simulate a new random sample Z 1,..., Z m from π 1 and Z m+1,..., Z m from π with m = 100 and estimate the actual error rates by the proportion of misclassification in Z 1,..., Z m. The simulation size is Figure 1 presents the comparison between results from theory and simulation based on information above for misclassification probabilities of linear and quadratic classifiers under location shift and scale shift, as described in Section. It is clearly seen that the sample estimate of probability of misclassification is a good approximation for the population version of it. As expected, the error rate is nearly 0.5 when δ = 0 and it decreases as δ goes away from 0 and the separation between the populations increases for location shift case. Consider the nonlinear case where µ 1 = 0, 0 T and µ = δ, 0 T, Σ 1 = I and Σ = σ I. Figure shows that error rate is affected by the difference in location and scale parameters of competing populations. The error rate is smaller with σ than σ, given that σ > 1. It can be ascertained from figure that the error rate for σ and σ at the medians of symmetric distributions are the same. Also, misclassification rate decreases as σ goes farther from 1.

6 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers. 948 Probability of misclassification Theoretical Simulation Probability of misclassification Theoretical Simulation δ σ a LDA b QDA Fig. 1: Misclassification probability: Theoretical versus Simulation. Theoretically, LDA and QDA can not used for discriminating between distributions whose first and second moments do not exist. Furthermore, the presence of an outlying training sample point will affect the performance of LDA and QDA. Hence, both linear and quadratic classifiers are not robust against outliers and extreme values. However, Hubert and Van Driessen 004 proposed replacing the estimates of µ 1, µ, Σ 1 and Σ in Equation1 by reweighted minimum covariance determinant MCD estimator of multivariate location and scatter based on FAST-MCD algorithm of Rousseeuw and Van Driessen Theoretical Bayes Risk - Location Shift We want to derive misclassification probability associated with Bayes rule for some competing distributions with location shift in a two-class problem. The distributions are multivariate t distribution with k degree of freedom and multivariate Laplace distribution. For multivariate normal distributions, the probability of misclassification associated with Bayes rule is discussed in Section above. Multivariate t distributions Let Z N d 0, Σ and U χ k be independent, where k is the degree of freedom of Chisquared distribution. Define X = Z + µ. The distribution of X is multivariate t k U distribution with k degree of freedom, denoted by tk, µ, Σ. The probability density function

7 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers. 949 Probability of misclassification σ = 0. σ = 0.5 σ = 1 σ = σ = δ Fig. : Plot of error rate for location-scale shift problem using QDA. of x is fx = kπ d Γ k+d Γ k Σ 1 1 {1 + k x µ Σ 1 x µ} k+d. 4 Suppose π 1 has distribution tk, µ 1, Σ with probability density function f 1 x and π has distribution tk, µ, Σ with probability density function f x. Under the assumption of equal prior probabilities, Bayes rule is to assign x to π 1 if f 1 x > f x, which is equivalent to assign x to π 1 if x µ 1 Σ 1 x µ 1 < x µ Σ 1 x µ. This holds since the competing distributions have the same degree of freedom. The expression above is equivalent to x Σ 1 µ 1 µ + µ 1 + µ Σ 1 µ 1 µ < 0 and can also be written as k z u + µ Σ 1 µ 1 µ 1 µ 1 + µ Σ 1 µ 1 µ > 0,

8 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers. 950 k where x = z u + µ and z is distributed as N d0, Σ. If x is from π 1, µ = µ 1. Similarly, if x is from π, µ = µ. It follows that k P 1 = P u z Σ 1 µ 1 µ + 1 ] µ 1 µ Σ 1 µ 1 µ < 0 ] u k µ 1 µ Σ 1 µ 1 µ. = P z Σ 1 µ 1 µ < 1 This holds because u takes values in 0,. For either of the population, E z Σ 1 µ 1 µ = 0 and var z Σ 1 µ 1 µ = µ 1 µ Σ 1 µ 1 µ, then P 1 = P R < 1 = P R < 1 ] u k c 0 = u k µ 1 µ Σ 1 µ 1 µ µ1 µ Σ 1 µ 1 µ 1/ Φc 1 f u udu where R is a standard normal random variable defined as R = z Σ 1 µ 1 µ Ez Σ 1 µ 1 µ ] varz Σ 1 µ 1 µ, 1/ ] Φ is the distribution function of the standard normal distribution, f u is probability density function of χ k, u is Chi-squared distributed random variable, Similarly, k P 1 = P = P z Σ 1 µ 1 µ > 1 = P R > c 1 = 1 u k c 0, c = 1 u k c 0. u z Σ 1 µ 1 µ 1 ] µ 1 µ Σ 1 µ 1 µ > 0 ] u k µ 1 µ Σ 1 µ 1 µ 1 u k µ 1 µ Σ 1 ] 1 µ 1 µ µ1 µ Σ 1 µ 1 µ = 1 P R < 1/ = 1 P R < 1 ] u k c 0 = 1 Φc f u udu u k µ 1 µ Σ 1 µ 1 µ µ1 µ Σ 1 µ 1 µ 1/ The probability of misclassification associated with Bayes rule, denoted by B, is B = p 1 P 1 + p P 1 = p 1 Φc 1 f u udu + p 1 Φc f u udu. ]

9 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers. 951 Multivariate Laplace distributions Suppose the distribution of X R d is multivariate Laplace distribution Lµ, Σ, where µ and Σ are mean vector and covariance matrix of the distribution respectively. The probability density function of x is of the form fx e x µ Σ 1 x µ. Without loss of generality, let d =, r Gammad, θ Uniform0, π, µ = µ 1, µ. Define Z1 Z 1 = r cos θ, Z = r sin θ, Z =. Then, X = Σ 1 Z + µ has bivariate Laplace distribution BLµ, Σ, where µ and Σ are mean and covariance of the distribution respectively. It follows that X µ = Σ 1 Z. Suppose populations π 1 and π have distribution functions BLµ 1, Σ and BLµ, Σ respectively. If x π 1, then Z = Σ 1 X µ1, x µ 1 Σ 1 x µ 1 = z z = r Gammad and x µ Σ 1 x µ r except µ 1 = µ, where d =. Similarly, if x π, then Z = Σ 1 X µ, x µ Σ 1 x µ = z z = r Gammad and x µ 1 Σ 1 x µ 1 r except µ 1 = µ. It follows that log f1 x = x µ f x 1 Σ 1 x µ 1 + x µ Σ 1 x µ. Z Assuming equal prior probabilities, and for x, µ 1, µ hyperplane between π 1 and π can be written as R d and d, the separating x µ 1 Σ 1 x µ 1 = x µ Σ 1 x µ. This is equivalent to x Σ 1 µ 1 µ = 1 µ 1 + µ Σ 1 µ 1 µ. It follows that if x is distributed as population π 1, x Σ 1 µ 1 µ = 1 µ 1+µ Σ 1 µ 1 µ implies x µ 1 Σ 1 µ 1 µ = 1 µ 1 µ Σ 1 µ 1 µ and can be written as z a = 1 a a, where a = Σ 1/ µ 1 µ and z is a standard multivariate Laplace distributed random variable. Kotz et al. 001 has shown that linear combination of standard multivariate Laplace random variables has a univariate symmetric Laplace distribution L0, σ l See Proposition in pp. 3. That is, w = a z has a univariate Laplace distribution with mean 0 and variance σ l, where σ l = vara z and a is a vector of constant real numbers. Similarly, if x is distributed as population π, the separating hyperplane remains unchanged.

10 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers. 95 Observe that f 1 x > f x implies x µ 1 Σ 1 x µ 1 < x µ Σ 1 x µ and x Σ 1 µ 1 µ > 1 µ 1 + µ Σ 1 µ 1 µ. It follows that P 1 = P x Σ 1 µ 1 µ < 1 µ 1 + µ Σ 1 µ 1 µ x π 1 = P x Σ 1 µ 1 µ µ 1 Σ 1 µ 1 µ < 1 µ 1 + µ Σ 1 µ 1 µ µ 1 Σ 1 µ 1 µ = P z a < 1 µ 1 µ Σ 1 µ 1 µ = P w < 1 µ 1 µ Σ 1 µ 1 µ = F 1 µ 1 µ Σ 1 µ 1 µ, where F is the distribution function of 1-dimensional symmetric Laplace distribution L 0, c 0 with c 0 > 0. Similarly, P 1 = P x Σ 1 µ 1 µ > 1 µ 1 + µ Σ 1 µ 1 µ x π = P x Σ 1 µ 1 µ µ Σ 1 µ 1 µ > 1 µ 1 + µ Σ 1 µ 1 µ µ Σ 1 µ 1 µ = P z a > 1 µ 1 µ Σ 1 µ 1 µ = P w > 1 µ 1 µ Σ 1 µ 1 µ 1 = 1 F µ 1 µ Σ 1 µ 1 µ where F is as defined above. The Bayes probability of misclassifying of x into either π 1 or π, denoted by B, is B = p 1 P 1 + p P 1 = p 1 F 1 ] 1 µ 1 µ Σ 1 µ 1 µ + p 1 F µ 1 µ Σ 1 µ 1 µ where p 1 + p = 1. Suppose G is a Laplace distribution function which is symmetric about c, then G c = 1 Gc for all c R. Hence B = p 1 F 1 µ 1 µ Σ 1 µ 1 µ + p F 1 ] µ 1 µ Σ 1 µ 1 µ = F 1 µ 1 µ Σ 1 µ 1 µ 5. Concluding Remarks In this paper, we have considered probabilities of misclassification between two populations only. However, these can be extended to more than two populations easily. The optimal performance of linear and quadratic discriminant functions are investigated and provide solutions of some theoretical examples. The theoretical probabilities of misclassification are compared with empirical error rates based on simulation, when competing populations differ in location and scale. The sample estimates of probability of misclassification associated with LDA and QDA are good approximation for their respective population versions. We derive expressions for Bayes error for multivariate Laplace distributions and multivariate t distributions with the same degree of freedom, under location shift.

11 O.S. Makinde, Afrika Statistika, Vol. 111, 016, pages On misclassification probabilities of linear and quadratic classifiers. 953 References Anderson, T.W., An introduction to multivariate statistical analysis. John Wiley & Sons, Inc, New York. Chang, P.C. and Afifi, A.A., 008. Classification based on dichotomous and continuous variables. JASA, 69, Fisher, R.A., The use of multiple measurements in taxonomic problems. Ann. Eug., 7, Ghosh, A.K. and Chaudhuri, P., 005. On maximum depth and related classifiers. Scand. J. of Stat., 3, Gilbert, E.S., The effect of unequal variance-covariance matrices on Fisher s linear discriminant function, Biometrics, 53, Hall, P., Titterington, D.M. and Xue, J., 009. Median Based classifiers for High Dimensional Data. Journal of the American Statistical Association, , Hubert, M. and Van Driessen, K., 004. Fast and robust discriminant analysis. Computational Statistics and Data Analysis, 45, Johnson, R.A. and Wichern, D.W., 007. Applied multivariate statistical analysis. Sixth edition, Pearson Prentice Hall inc. New Jersey. Kim, K.S., Choi, H.H., Moon, C.S. and Mun, C.W., 011. Comparison of k-nearest neighbor, quadratic discriminant and linear discriminant analysis in classification of electromyogram signals based on the wrist-motion directions. Curr. Appl. Phy., 11, Kotz, S., Kozubowski, T. and Podgorski, K., 001. The Laplace distribution and generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance. Springer Science+Business Media, LLC. Krzanowski, W.J., The performance of Fisher s linear discriminant function under non-optimal conditions, Technometrics, 19, Li, J., Cuesta-Alberstos J.A. and Liu, R.Y., 01. DD-Classifier: Nonparametric Classification Procedure Based on DD-plot, JASA, 107, Makinde, O.S. and Chakraborty, B., 015. On some nonparametric classifiers based on distribution functions of multivariate ranks. In Nordhausen, K and Taskinen, S.eds: Modern Nonparametric, Robust and Multivariate Methods, Festschrift in Honour of Hannu Oja. Springer, Rousseeuw, P.J. and Van Driessen, K., A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41, 1-3. Wald, A., On a statistical problem arising in the classification of an individual into one of two groups. Annals of Math. Stat., 15, Welch, B.L., Note on discriminant functions. Biometrika, 31, 18-0.

On Some Classification Methods for High Dimensional and Functional Data

On Some Classification Methods for High Dimensional and Functional Data by OLUSOLA SAMUEL MAKINDE A thesis submitted to The University of Birmingham for the degree of Doctor of Philosophy School of Mathematics