Klasifikace pomocí hloubky dat nové nápady

Size: px

Start display at page:

Download "Klasifikace pomocí hloubky dat nové nápady"

Roderick Austin
5 years ago
Views:

1 Klasifikace pomocí hloubky dat nové nápady Ondřej Vencálek Univerzita Palackého v Olomouci ROBUST Rybník, 26. ledna 2018

3 Z. Fabián: Vzpomínka na Compstat 2010 aneb večeře na Pařížské radnici, Informační Bulletin ČStS, 4/2017 Nápad pořádat konferenci v Paříži se dustojně řadí k nápadu pořádat mistrovství světa v Kataru nebo letní olympiádu v Letňanech....

4 It is deep inside = it has high depth

5 It is deep inside = it has high depth Trees DEEP inside the forest

6 It is deep inside = it has high depth Trees DEEP inside the forest Trees far from the center

7 It is deep inside = it has high depth

8 Outliers in R 1 outlier

9 Outliers in R 2 any idea?

10 1st idea Mahalanobist distance

11 1st idea Mahalanobist distance non-elliptical case

12 1st idea Mahalanobist distance non-elliptical case

13 Halfspace depth in R 1 D(x) = min {P(X x), P(X x)} Example: univariate normal distribution N(µ, σ 2 ) density(x) outliers the deepest point (median) decreasing depth decreasing depth outliers

14 Halfspace depth in R d projection pursuit approach D(x) = inf u: u =1 D(uT x). x 2? x 1

15 The halfspace depth of a point x R d with respect to a probability measure P on R d is defined as the minimum probability mass carried by any closed halfspace containing x, that is D(x; P) = inf {P(H) : H a closed halfspace, x H} The depth provides so called central-outward ordering of data The deepest point extends the notion of median into a higher dimensions.

16 Problem of classification The Bayes classifier class(x) = arg max f i (x)π i. i produces smallest possible number of misclassified points (total over all groups)

17 When Bayes misclassifies the centre of the distribution... P 1 = N(0, 1), P 2 = N(1, 1) π 1 = 0.7, π 2 = 0.3 posterior x

18 Hey, Bayes, be fair to x 2! lognormal distribution from N(0, 1) 0.6 f(x 1 ) density f(x 2 ) x x x quantile 0.05 = x 1, f (x 1 ) = 0.53 quantile 0.95 = x 2, f (x 2 ) = 0.02

19 Basics in decision theory Distributions P i defined on R d, with densities f i and prior probabilities π i, i = 1,..., K. A classifier divides the space R d into K disjoint parts A i, i = 1,..., K, K i=1 A i = R d such that any x R d is assigned to P i iff x A i. Cost function c ij (x) = { c i (x) if j i, 0 if j = i. Total cost min K i=1 j i A j c i (x)f i (x)π i dx. Optimal classifier class(x) = arg max c i (x)f i (x)π i. i

20 Bayes classifier and its 2 new competitors class(x) = arg max c i (x)f i (x)π i. i 1. Bayes classifier: c i (x) = 1 class B (x) = arg max f i (x)π i i 2. Depth-weighted classifier: c i (x) = D(x; P i ) class D (x) = arg max D(x; P i )f i (x)π i i 3. Rank-weighted classifier: c i (x) = F i (D(x; P i )), where F i is CDF of D(X ; P i ), X P i class R (x) = arg max F i (D(x; P i ))f i (x)π i i

21 Toy example P 1 = Unif [0, 100], π 1 = 0.5, P 2 = Unif [50, 250], π 2 = 0.5. Classification on the overlapping part of supports: Bayes classifier: Depth-weighted classifier: class B (x) = 1 for x [50, 100] class D (x) = { 1 for x [50, 90), 2 for x (90, 100].

23 The first question (or two) To what extent do the new (depth-weighted) classifiers differ from the Bayes classifier? How large might the corresponding difference in the average misclassification rate be?

24 Difference in the case of elliptical symmetry Let us assume: (P1): f i (x) = k i g(m i (x)), where g is a strictly decreasing function, k i > 0 are constants, and M i (x) = ( (x α i ) B 1 i (x α i ) ) 1 2 with B i positive definite, α i R d, denotes the generalized distance of the point x from the center of the distribution P i. (P2) D(x) is an affine invariant depth. Then, given (P1), D i (x) = D(x; P i ) is a fixed decreasing function of generalized distance, that can be expressed as D i (x) = h(m i (x)), where h is a strictly decreasing function. Denote G(x) = k1g(m1(x)) k 2g(M 2(x)) H(x) = h(m1(x)) h(m 2(x)) π = π2 π 1 Then as likelihood ratio, as depth ratio, as inverse prior ratio The Bayes classifier assigns x to P 1 if G(x) > π, The depth-weighted classifier assigns x to P 1 if H(x)G(x) > π.

25 Difference in the case of elliptical symmetry x: M1 (x)>m2 (x) 0 H(x)G(x) G(x) π k_1 k2 x: M1 (x)<m2 (x) 0 k_1 k2 G(x) H(x)G(x) π The classifiers differ when π is between G (x) and H(x)G (x). For the fixed π, the region where the classifiers differ is RD(π) = x Rd : H(x)G (x) < π < G (x) or G (x) < π < H(x)G (x). Let (P1) and (P2) hold for P1 and P2. Then P(classB (X ) 6= classd (X )) = 0 π1 k1 = π2 k2.

26 Difference in the case of elliptical symmetry example P 1 = N( 1, 1), P 2 = N(1, 1) probability of different classification log(π 1 π 2 )

27 Next question To what extent does the performance of the new classifiers depend on the choice of depth function? Which depth function results in the smallest (biggest) differences from the Bayes classifier?

28 Difference between the depth-based classifiers using halfspace and projection depths P 1 = Unif [0, 100], π 1 = 0.5, P 2 = Unif [50, 250], π 2 = 0.5. D(x)f(x)π for the halfspace depth D(x)f(x)π for the projection depth D(x)f(x)π D(x)f(x)π x x

29 Rank-weighted classifier Depth-weighted classifier: c i (x) = D(x; P i ) class D (x) = arg max D(x; P i )f i (x)π i i Rank-weighted classifier: c i (x) = F i (D(x; P i )), where F i is CDF of D(X ; P i ), X P i class R (x) = arg max F i (D(x; P i ))f i (x)π i i Let (P1) and (P2) hold for P 1 and P 2 and any two depth functions D( ) and D ( ). Then class R (x) is the same for both D( ) and D ( ) with probability one.

30 Robustness of the newly proposed classifiers density 0 z 2z 3z z0 3z 3z+z0 4z x (1 α)π 1 for P 1, απ 1 for the contamination of P 1 π 2 for the non-contaminated P 2. Assume z 0 = z, π 2 < απ 1.

31 Robustness of the newly proposed classifiers density 0 z 2z 3z z0 3z 3z+z0 4z x (1 α)π 1 for P 1, απ 1 for the contamination of P 1 π 2 for the non-contaminated P 2. Assume z 0 = z, π 2 < απ 1. f 2 (x)π 2 < f 1 (x)π 1 for all x (2z, 4z). Misclass. rate for group 2 is 1. D 2 (x)f 2 (x)π 2 > D 1 (x)f 1 (x)π 1 for x > 2z + 2π 2 z Misclass. rate for group 2 is π 2

32 Simulation study settings Group 1 Group 2 Ex. Distribution Parameters Distribution Parameters 1 Normal 0, Σ 0 Normal 1, Σ 0 2 Normal 0, Σ 0 Normal 1, 4Σ 0 3 Cauchy 0, Σ 0 Cauchy 1, Σ 0 4 Cauchy 0, Σ 0 Cauchy 1, 4Σ 0 5 Bivar. exponen. 1, 1 Shifted bivar. expon. (+1) 1, 1 6 Bivar. exponen. 1, 1/2 Shifted bivar. expon. (+1) 1/2, 1 7 Normal 0, I Bivar. exponential 1, 1 8 Skewed normal ( 1 1 Where Σ 0 = 1 4 ). ( 12 ), ( ), ( 2 5 ) Skewed normal ( 0 1 ), ( ) ( , 15 )

33 Percentage of points classified differently than by the Bayes classifier 30 Example 1 Example 2 Example 3 Example 4 Example 5 Example 6 Example 7 Example 8 Different classification, % pi1: 0.1 pi1: 0.3 pi1: 0.5 pi1: 0.7 pi1: 0.9 classifier DW RW 0 halfspaceprojection spatial halfspaceprojection spatial halfspaceprojection spatial halfspaceprojection spatial halfspaceprojection spatial halfspaceprojection spatial halfspaceprojection spatial halfspaceprojection spatial depth

34 Increase of AMR versus percentage of points classified differently than by the Bayes classifier halfspace depth projection depth spatial depth difference in AMR (%) difference in AMR (%) difference in AMR (%) Percentage of points classified differently Percentage of points classified differently Percentage of points classified differently depth optimal class. rank optimal class.

35 Conclusion We suggest to weight misclassification cost according to the centrality of a misclassified point measured either by its depth w.r.t. the distribution from which it comes (depth-weighted classifier) or by its rank w.r.t. the distribution from which it comes (rank-weighted classifier). In particular cases, the new classifiers does not differ from the Bayes classifier. Simulation study showed that increase in AMR is much smaller than percentage of points classified differently. Both classifiers depend on the depth function which they are using. In particular cases rank-weighted classifier does not depend on the used depth function. Presence of the depth term in the depth-based classifiers may substantially increase robustness of the procedure.

36 Ondrej Vencalek Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacky University in Olomouc, 17. listopadu 12, Olomouc, Czech Republic

37 Happy end

Statistical Depth Function

Statistical Depth Function Machine Learning Journal Club, Gatsby Unit June 27, 2016 Outline L-statistics, order statistics, ranking. Instead of moments: median, dispersion, scale, skewness,... New visualization tools: depth contours,