Kernel Density Estimation

Size: px

Start display at page:

Download "Kernel Density Estimation"

Laura Underwood
5 years ago
Views:

1 Kernel Density Estimation and Application in Discriminant Analysis Thomas Ledl Universität Wien

2 Contents: Aspects of Application

4 observations: 0 Which distribution?

5 0?? ???

6 0 Kernel density estimator model: K(.) and h to choose

7 triangular 0 small h large h kernel/ bandwidth: gaussian

8 Question : Which choice of K(.) and h is the best for a descriptive purpose?

9 - -, 0, - -, -,6-0,9-0, 0,,,9,6 0 0,0 0,0 0,06 0,08 0, 0, 0, Classification: - -, -, -0,6 0,,8,6 - -, 0,, 0,00 0,0 0,0 0,0 0,0 0,0 0,06 0,07 0,08 0,09

10 Levelplot LDA (based on assumption of a multivariate normal distribution): 7 Classification: V V

11 7 Classification: V V

12 Levelplot KDE classificator: 7 7 Classification: V V V V

13 Question : Performance of classification based on KDE in more than dimensions?

15 Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h

16 Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h

17 Optimization criteria L p -distances:

18 f(.) g(.)

20 =IAE Integrated absolute error =ISE Integrated squared error

21 =IAE Integrated absolute error =ISE Integrated squared error

22 Other ideas: Minimization of the maximum vertical distance Consideration of horizontal distances for a more intuitive fit (Marron and Tsybakov, 99) Compare the number and position of modes

23 Overview about some minimization criteria L -distance=iae Difficult mathematical tractability L -distance=maximum Does not consider difference overall fit Modern criteria, Difficult mathematical which include a kind of tractability measure of the horizontal distances L -distance=ise, MISE,AMISE,... Most commonly used

24 ISE, MISE, AMISE,... ISE is a random variable MISE=E(ISE), the expectation of ISE AMISE=Taylor approximation of MISE, easier to calculate Density x MISE,IV,ISB AMISE,AIV,AISB log0(h)

25 Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h

26 The AMISE-optimal bandwidth

27 The AMISE-optimal bandwidth dependent on the kernel function K(.) minimized by Epanechnikov kernel

28 The AMISE-optimal bandwidth dependent on the unknown density f(.) How to proceed?

29 Data-driven bandwidth selection methods Leave-one-out selectors Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 98) Criteria based on substituting R(f ) in the AMISE-formula Normal rule ( Rule of thumb ; Silverman, 986) Plug-in methods (Sheather and Jones, 99; Park and Marron,990) Smoothed bootstrap

30 Data-driven bandwidth selection methods Leave-one-out selectors Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 98) Criteria based on substituting R(f ) in the AMISE-formula Normal rule ( Rule of thumb ; Silverman, 986) Plug-in methods (Sheather and Jones, 99; Park and Marron,990) Smoothed bootstrap

31 Least squares cross-validation (LSCV) Undisputed selector in the 980s Gives an unbiased estimator for the ISE Suffers from more than one local minimizer no agreement about which one to use Bad convergence rate for the resulting bandwidth h opt

32 Data-driven bandwidth selection methods Leave-one-out selectors Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 98) Criteria based on substituting R(f ) in the AMISE-formula Normal rule ( Rule of thumb ; Silverman, 986) Plug-in methods (Sheather and Jones, 99; Park and Marron,990) Smoothed bootstrap

33 Normal rule ( Rule of thumb ) Assumes f(x) to be N(µ,σ ) Easiest selector Often oversmooths the function The resulting bandwidth is given by:

34 Data-driven bandwidth selection methods Leave-one-out selectors Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 98) Criteria based on substituting R(f ) in the AMISE-formula Normal rule ( Rule of thumb ; Silverman, 986) Plug-in methods (Sheather and Jones, 99; Park and Marron,990) Smoothed bootstrap

35 Plug in-methods (Sheather and Jones, 99; Park and Marron,990) Does not substitute R(f ) in the AMISEformula, but estimates it via R(f (IV) ) and R(f (IV) ) via R(f (VI) ),etc. Another parameter i to chose (the number of stages to go back) one stage is mostly sufficient Better rates of convergence Does not finally circumvent the problem of the unknown density, either

36 The multivariate case h H...the bandwidth matrix

37 Issues of generalization in d dimensions d instead of one bandwidth parameter Unstable estimates Bandwidth selectors are essentially straightforward to generalize For Plug-in methods it is too difficult to give succint expressions for d> dimensions

38 Aspects of Application

39 Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

40 Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

41 The curse of dimensionality The data disappears into the distribution tails in high dimensions Probability mass NOT in the "Tail" of a Multivariate Normal Density 00% 80% 60% 0% 0% 0% d # of dimensions : a good fit in the tails is desired!

42 The curse of dimensionality Much data is necessary to obey a constant estimation error in high dimensions Dimensionality Required sample size

43 Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

44 Essential issues AMISE-optimal parameter choice L -optimal Worse fit in the tails Calculation intensive for large n Optimal classification (in high dimensions) L -optimal (Misclassification rate) Estimation of tails important Many observations required for a reasonable fit

45 Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

46 Method : Reduction of the data onto a subspace which allows a somewhat accurate estimation, however does not destoy too much information trade-off Use the multivariate kernel density concept to estimate the class densities

47 Method : Use the univariate concept to normalize the data nonparametrically Use the classical methods like LDA and QDA for classification Drawback: calculation intensive

48 Method : a) f(x) F(x) G(x) b) x x t(x) t(x+ ) x x+

50 Criticism on former simulation studies Carried out 0-0 years ago Out-dated parameter selectors Restriction to uncorrelated normals Fruitless estimation because of high dimensions No dimension reduction

51 The present simulation study datasets estimators error criteria xx=88 classification scores Many results

52 The present simulation study datasets estimators error criteria xx=88 classification scores Many results

53 Each dataset has classes for distinction observations/class...00 test observations, 00 produced by each class... therfore dimension 00x0

54 Normal Normal-noise small Univariate prototype distributions: Normal-noise medium Normal-noise large Exponential () Bimodal - close Bimodal - far

55 Dataset Nr. Abbrev. contains NN 0 normal distributions with "small noise" NN 0 normal distributions with "medium noise" NN 0 normal distributions with "small noise" SkN skewed (exp-)distributions and 7 normals SkN skewed (exp-)distributions and normals 6 SkN 7 skewed (exp-)distributions and normals 7 Bi normals, skewed and bimodal (close)-dist. 8 Bi normals, skewed and bimodal (close)-dist. 9 Bi 8 skewed and bimodal (far)-dist. 0 Bi 8 skewed and bimodal (far)-dist. 0 datasets having equal covariance matrices +0 datasets having unequal covariance matrices + insurance dataset datasets total

56 datasets estimators error criteria xx=88 classification scores Many results

57 Method (multivariate density estimator): Principal component reduction onto,, and dimensions () x multivariate normal rule and multivariate LSCV-criterion,resp. () Method ( marginal normalizations ): Univariate normal rule and Sheather-Jones plug-in () x subsequent LDA and QDA () Classical methods: LDA and QDA () 8 estimators estimators estimators estimators

58 datasets estimators misclassification criteria xx=88 classification scores Many results

59 Misclassification Criteria The classical Misclassification rate ( Error rate ) The Brier score

60 datasets estimators error criteria xx= 88 classification scores Many results

61 Results The choice of the misclassification criterion is not essential Error rate vs. Brier score Brier score 0,80 0,70 0,60 0,0 0,0 0,0 0,0 0,0 0,00 0,0 0, 0, 0, 0, 0, 0,6 Error rate

62 Results The choice of the multivariate bandwidth parameter (method ) is not essential in most cases Error rates for method LSCV 0,00 0,0 0,00 0,0 0,00 0,0 0,00 0,0 0,00 0,00 Superiority of LSCV in case of bimodals having unequal covariance matrices 0,000 0,000 0,00 0,00 0,00 0,00 0,00 0,600 "Normal rule"

63 Results The choice of the univariate bandwidth parameter (method ) is not essential Error rates for method Sheather-Jones selecto or 0,00 0,0 0,00 0,0 0,00 0,00 0,000 0,000 0,00 0,00 0,0 0,00 0,0 0,00 "Normal rule"

64 Results The best trade-off is a projection onto - dimensions Error rate regarding different subspaces 0,0 0,00 0,0 0,00 0,0 0,00 0,00 0,000 # dimensions NN-distributions SkN-distributions Bi-distributions

65 Results Equal covariance matrices: Method performs sometimes inferior slightly against improves LDA Error rate 0,00 0,00 0,0 0,0 0,00 0,00 0,0 0,00 0,0 0,0 0,00 0,00 0,00 0,00 0,000 0,000 N N N N N N SkN SkN SkN Bi Bi Bi Bi NN NN NN SkN SkN SkN Bi Bi Bi Bi LDA (classical) LDA (classical) Normal LSCV() rule (in- method )

66 Results E rror rate E rro r rate Unequal Unqual covariance matrices: matrices: Method Method often performs improves quite essentially poor, but not for skewed distributions 0,0 0,0 0,00 0,00 0,0 0,0 0,00 0,00 0,00 0,000 0,000 NN NN NN NN NN NN SkN SkN SkN SkN SkN SkN Bi Bi Bi Bi Bi Bi Bi Bi QDA (classical) QDA (classical) LSCV() Normal - rule method (in method )

67 Results Is the additional calculation time justified? Required calculation time LDA,QDA multivariate "normal rule" Preliminary univariate normalizations,lscv, Sheather-Jones plug-in

69 (/) Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

70 (/) Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

71 (/) Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

72 (/) Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

73 (/) Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

74 (/) KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements ( dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized ( upper bounds in L -theory, L -theory: ISE MISE AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

75 (/) KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements ( dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized ( upper bounds in L -theory, L -theory: ISE MISE AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

76 (/) KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements ( dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized ( upper bounds in L -theory, L -theory: ISE MISE AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

77 (/) KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements ( dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized ( upper bounds in L -theory, L -theory: ISE MISE AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

78 (/) vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The widely ignored curse of dimensionality forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates Much data lead to a explosion of the computation time

79 (/) vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The widely ignored curse of dimensionality forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates Much data lead to a explosion of the computation time

80 (/) vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The widely ignored curse of dimensionality forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates Much data lead to a explosion of the computation time

81 (/) vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The widely ignored curse of dimensionality forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates Much data lead to a explosion of the computation time

82 (/) vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The widely ignored curse of dimensionality forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates Much data lead to a explosion of the computation time

83 The End

Kernel Density Estimation: Theory and Application in Discriminant Analysis

AUSTRIAN JOURNAL OF STATISTICS Volume 33 (2004), Number 3, 267-279 Kernel Density Estimation: Theory and Application in Discriminant Analysis Thomas Ledl Department of Statistics and Decision Support Systems,