applications Rome, 9 February Università di Roma La Sapienza Robust model based clustering: methods and applications Francesco Dotto Introduction

Size: px

Start display at page:

Download "applications Rome, 9 February Università di Roma La Sapienza Robust model based clustering: methods and applications Francesco Dotto Introduction"

Alvin Archibald Gallagher
5 years ago
Views:

1 model : fuzzy model : Università di Roma La Sapienza Rome, 9 February

2 Outline of the presentation model : fuzzy 1 General motivation 2 algorithm on trimming and reweigthing. 3 algorithm on trimming and geometric 4 clusterwise fuzzy regression model on trimming 2 and 4 : joint work with Alessio Farcomeni, Luis Angel Garcia Escudero and Agustin Mayo Iscar. 3 : joint work with Alessio Farcomeni

3 General Motivation model : fuzzy The presence of outlying observations may lead to unsatisfactory results like i.e: 1 Hetherogeneus groups artificially joined together 2 Spurious clusters containing only outlying observations may be detected 3 estimation of clusters parameters may be inconsistent To deal with these outlined issues we propose some robust methods on trimming

4 Arising problems model : Two main problems affect the existing robust method (detailed review in [Farcomeni and Greco, 2015] and [Ritter, 2015]): 1 Tuning the parameter establishing the proportion of trimmed observations 2 Choosing a proper constraint on the scatter matrices (required to avoid the effect of spurious maximizers) fuzzy

5 rtclust algorithm model : For issue 1 we propose the usage of the rtclust procedure ([ et al., 2015b]): fuzzy x 2 x 2 x True assignments x Initial x Final x 2 x 2 x True assignments x Initial x Final We start from the output of a robust method (i.e. the tclust [García-Escudero et al., 2008]) with a very high trimming level and then apply reweighting. x x

6 rtclust approach sketch of the algorithm model : fuzzy The reweighting proceeds, for pre-fixed α 1 ą α 2 ą... ą α L for each l 1,..., L as 1 Sort the Mahalanobis distances d p1q,, d pnq and take the sets: A tx i : d i ď d prnp1 αl qsqu and B tx i : d i ď χ 2 p,α L u (1) 2 Fix A X B th 1,, H K u, with " * H j x i P AXB such that d Σ l px i, m l j jq min d Σ q 1,...,k l px q i, mqq l 3 Update the clusters proportions and current contamination level 4 Update clusters centers and scatter matrices (2)

7 rtclust approach 5% contaminated dataset: Estimated location parameter model : Estimated µ p=2 p=4 p=6 rtlcust H&R Iterated H&R tclust MSE µ^ fuzzy 0.0 rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 Figure 1: Model s performance for µ estimation when p 2, 4, 6. Whenever the value exceeds the scale of the plot we put a Ĳ

8 Simulations results 5% contaminated dataset: Estimated scale parameter model : Estimated Σ p=2 p=4 p=6 rtlcust H&R Iterated H&R tclust MSE Σ^ fuzzy rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 Figure 2: Model s performance for Σ estimation when p 2, 4, 6 Whenever the value exceeds the scale of the plot we put a Ĳ

9 Simulations results 5% contaminated dataset: Estimated contamination level model : p=2 Estimated α p=4 p=6 rtlcust H&R Iterated H&R tclust α^ fuzzy rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 rtclust33 rtclust20 HR33 HR20 HR_it33 HR_it20 tclust33 tclust20 tclust10 tclust05 Figure 3: Model s performance for Σ estimation when p 2, 4, 6 Whenever the value exceeds the scale of the plot we put a Ĳ

10 Constraining the obtained solution Target function model : fuzzy Generally speaking a robust procedure aims to maximize the objective function given by: where: ź K ź j ź j f px i ; µ j ; Σ j q g ψi px i q j 1 ipr j irr j 1 f p q stands for the multivariate normal density 2 g ψi p q is the contaminating density with mild probabilistic assumption on it 3 K is the number of groups 4 R Ť K j 1 R j is the set of the clean observation and is such that #R rnp1 αqs (3)

11 Constraining the obtained solution Why? model : fuzzy Equation (3) is unbounded, for that reason the effect of the spurious maximizers must be controlled. Spurious may defined as parameter points having 0 standard deviation for some components and can be generated by any small number of sample points grouped sufficiently close together. Such points make the objective function tend to infinity and thus a spurious solution is returned as output of the algorithm

12 What spurious maximizers actually are??? A graphical representation model : Constrained Variances Uncostrained Variances fuzzy Figure 4: function y y x x Effect of constraint in the maximization of the objective

13 Constraining the obtained solution Some feasible solutions model : fuzzy Many solution to avoid extreme cases like Figure 12 can be used 1 Constraining on the ratio between between the highest eigenvalue and the lowest eigenvalue is one of the possible solutions: max j 1,2...,K max h 1,2...,p λ h pσ j q min j 1,2...,K min h 1,2...,p λ h pσ j q M n m n ď c with c P R Advantages: Interpetability, easy to be implemented. Disadvantages:When constraint the scale invariance of the obtained estimators is lost 2 Inserting on the estimated clusters shapes (4)

14 How to obtain models? model : fuzzy Let us consider the eigenvalue decomposition given by ([Celeux and Govaert, 1995]): where: Σ k λ k D k A k D T k for k 1, 2,... K (5) 1 λ k Σ k 1{d is the volume of the k th cluster 2 A k is an orthogonal matrix with the eigenvalues of Σ k, the shape of each cluster 3 D k is a matrix whose columns are given by the eigenvectors of Σ k and it determines the direction of each cluster.

15 Different Parametrization... models mclust proposal model : fuzzy proposal (work in progress) is to provide a robust version of all the model standing in table Model Name Parametrization ER Invariance EII λi Not required Isometric transformations VII λ k I Not required Isometric transformations EEI λa Not required Scaling VEI λ k A Not required Scaling EVI λa k Required Traslation VVI λ k A k Required Traslation EEE λdad T Not required Linear transformations EEV λd k ADk T Not required Linear transformations VEV λ k D k ADk T Not required Linear transformations VVV λ k D k A k Dk T Required Traslation Table 1: List of possible models obtained imposing shapes to the detected clusters

16 proposal Motivation and State of Art model : fuzzy Motivations: 1 Obtaining invariance properties of the estimators 2 Usage of geometric, interpretable output for the researcher State of art: 1 Good performance in simulation w.r.t existing proposals 2 Evaluating a suitable method to choose the proper model

17 Fuzzy and regression Merging the two approaches model : fuzzy In [ et al., 2015a] - in review for ADAC, many improvements required...crossed fingers!!! - we proposed a robust clusterwise regression model on fuzziness: 1 Trimming allows us to reach robustness 2 Fuzziness allows us to deal with bridge uncertainty around the assignment Generally speaking we aim to maximize the following objective function: nÿ i 1 j 1 kÿ uij m log `p j f py i ; x 1 i b j ` bj 0, sj 2 q (6)

18 Linear model : fuzzy methods are generally on estimating k groups around suitably defined centroids óó Each unit is assigned to each groups minimizing a function of a distance from the centroid. Linear methods are used to search k structure around an explanatory variable óó The assignment of a unit to each group is on the minimization of the regression error.

19 Fuzzy model : fuzzy For a given dataset x px 1,..., x n q with x i P R p, a unit can be assigned to a cluster 1 ď j ď k following two approaches: Crispy assignments where u ij u ij P t0, 1u (7) # 1 if x i P j 0 if x i R j Each observation belong to only one cluster Fuzzy assignments u ij P r0, 1s (8) where # 1 if x i full assignment in j u ij 0 if x i no assignment in j Intermediate assignments are taken in account

20 Sketch of the simulation study MSE of the estimated β model : (a) (b) (c) (d) (e) (f) (g) (h) fuzzy creg EM A creg FTCR creg EM A creg FTCR creg EM A creg FTCR creg EM A creg FTCR

21 I model : fuzzy Celeux, G. and Govaert, G. (1995). Gaussian parsimonious models. Pattern recognition, 28: , F., Farcomeni, A., García-Escudero, L., and Mayo-Iscar, A. (2015a). A fuzzy approach to robust clusterwise regression. In review for Advances in Data Analysis and Classification., F., Farcomeni, A., García-Escudero, L., and Mayo Iscar, A. (2015b). The rtclust procedure for robust. In 10 th Scientific Meeting of the CLassification and Data Analysis Group of the Italian Statstical Sociey. Book of Astract. CUEC.

22 II model : fuzzy Farcomeni, A. and Greco, L. (2015). Methods for Data Reduction. Chapman & Hall/CRC Press. García-Escudero, L., Gordaliza, A., Matrán, C., and Mayo-Iscar, A. (2008). A general trimming approach to robust cluster analysis. Ann Stat, 36: Ritter, G. (2015). Cluster Analysis and Variable Selection. Chapman & Hall/CRC Press.

The power of (extended) monitoring in robust clustering

Statistical Methods & Applications manuscript No. (will be inserted by the editor) The power of (extended) monitoring in robust clustering Alessio Farcomeni and Francesco Dotto Received: date / Accepted: