A clustering approach to detect multiple outliers in linear functional relationship model for circular data

Size: px

Start display at page:

Download "A clustering approach to detect multiple outliers in linear functional relationship model for circular data"

Gloria Cummings
5 years ago
Views:

Journal of Applied Statistics ISSN: 0266-4763 (Print)

com/loi/cjas20 A clustering approach to detect multiple

circular data Nurkhairany Amyra Mokhtar, Yong Zulina Zubairi

Amyra Mokhtar, Yong Zulina Zubairi & Abdul Ghapor Hussin

data, Journal of Applied Statistics, DOI: 10.1080/02664763.

1080/02664763.2017.1342779 Published online: 29 Jun 2017.

related articles View Crossmark data Full Terms & Conditions

1 Journal of Applied Statistics ISSN: (Print) (Online) Journal homepage: A clustering approach to detect multiple outliers in linear functional relationship model for circular data Nurkhairany Amyra Mokhtar, Yong Zulina Zubairi & Abdul Ghapor Hussin To cite this article: Nurkhairany Amyra Mokhtar, Yong Zulina Zubairi & Abdul Ghapor Hussin (2017): A clustering approach to detect multiple outliers in linear functional relationship model for circular data, Journal of Applied Statistics, DOI: / To link to this article: Published online: 29 Jun Submit your article to this journal Article views: 21 View related articles View Crossmark data Full Terms & Conditions of access and use can be found at Download by: [University of Malaya] Date: 01 July 2017, At: 23:45

2 JOURNAL OF APPLIED STATISTICS, A clustering approach to detect multiple outliers in linear functional relationship model for circular data Nurkhairany Amyra Mokhtar a, Yong Zulina Zubairi b and Abdul Ghapor Hussin a a Faculty of Defence Sciences and Technology, National Defence University of Malaysia, Kuala Lumpur, Malaysia; b Centre for Foundation Studies in Science, University of Malaya, Kuala Lumpur, Malaysia ABSTRACT Outlier detection has been used extensively in data analysis to detect anomalous observation in data. It has important applications such as in fraud detection and robust analysis, among others. In this paper, we propose a method in detecting multiple outliers in linear functional relationship model for circular variables. Using the residual values of the Caires and Wyatt model, we applied the hierarchical clustering approach. With the use of a tree diagram, we illustrate the detection of outliers graphically. A Monte Carlo simulation study is done to verify the accuracy of the proposed method. Low probability of masking and swamping effects indicate the validity of the proposed approach. Also, the illustrations to two sets of real data are given to show its practical applicability. ARTICLE HISTORY Received 8 June 2016 Accepted 29 May 2017 KEYWORDS Linear functional relationship model; clustering; outliers detection; wind data; circular variables 1. Introduction An outlying observation, or outlier is defined as the one that appears to deviate markedly from other members of the sample in which it occurs. The outlier may be the result of gross deviation from prescribed experimental procedure or an error in calculating or recording the numerical value. If reasons are found for aberrant observations, then one should act accordingly and perhaps scrutinise also the other observations [9]. Published works on identification of outliers are aplenty for linear data; some examples are Grubbs [9], Rahman et al. [19], Sebert et al. [23] and Adnan et al. [3]. However, for circular data, the approach is not straight forward as data are of the wrapped around in the nature. In circular regression model, Difference Mean Circular Error Statistic (DMCE) was proposed to detect outliers [1]. It has been shown that the method is applicable to the Down and Mardia [7]circular regression model [20]. Also, COVRATIO has been proposed to identify outliers to a general linear regression model [2]. Recently, a clustering method for Down and Mardia circular circular regression model has been proposed [22]. It is worthwhile to note that aforementioned methods are for regression model that considers the presence of error term for the variablex only. When both variables are observed with error, we can describe the relationship using the functional linear relationship model. CONTACT Yong Zulina Zubairi yzulina@um.edu.my Centre for FoundationStudiesin Science, University of Malaya, Kuala Lumpur 50603, Malaysia 2017 Informa UK Limited, trading as Taylor & Francis Group

3 2 N. A. MOKHTAR ET AL. The functional relationship model is part of general class of error-in-variables model (EIVM), in which the underlying variables are deterministic or fixed [10]. There are some other models in EIVM which are structural relationship model and ultrastructural relationship model. In structural relationship model, the variables are random meanwhile in ultrastructural relationship model, the variables are the synthesis of linear functional and structural relationship model. With regard to outlier detection in linear functional relationship model, Hussin et al. [12] used COVRATIO statistic based on row deletion approach for Caires and Wyatt model [5]. Later in 2014, another statistic based on Functional Difference Mean Circular Error Statistic (FDMCE) for Caires and Wyatt model has been proposed [24]. However, these two methods detect outliers based on statistic measures only. Thus, the motivation of our studyistoproposeanalternativegraphicalapproach,usingdendrogram,todetectoutliers in linear functional relationship model. In this paper, Section 2 describes the materials and methods used in this study, which explains circular data, linear functional relationship model, single linkage of hierarchical agglomerative clustering technique, the power of performance measures and a simulation study of the method proposed. Section 3 of this paper shows the results of the simulation study and its applicability to a real data set meanwhile Section 4 summarise the proposed method. 2. Materials and methods 2.1. Circular data Data that occurs around a circle is known as circular data and are usually measured in degrees in the interval [0, 360 ) or in radian with interval [0, 2π).Circulardatacanbe of the vectorial or axial type. Vectorial data has directed segments where both angle and direction are associated with a point. Axial data refers to angular position of random lines in which neither end can be identified as the starting point. Examples of axial and circular data can be found in ecological and environment studies, such as the direction of the earth magnetic pole [4], wind speed direction [16] and angles of knee flexion measured to assess the recovery of orthopaedic patients [13]. ThemostpopularcirculardistributionistheVonMisesdistributionwiththeprobability density function 1 g(θ; μ, κ) = 2πI 0 (κ) eκ cos(θ μ), (1) where I 0 (κ) is the modified Bessel function of the first kind and order zero, which can be defined by: I 0 (κ) = 1 2π e κ cos θ dθ, (2) 2π 0 where μ is the mean directionand κ is the concentration parameter. The concentration parameter κ influences the Von Mises distribution VM(μ, κ) inversely as σ 2 influences the Normal distribution N(μ, σ 2 ).Thus,aconcentratedVon Misesdistributionwillhavelargeconcentrationparameter,andadispersedVonMises distribution will have a small concentration parameter [5].

4 JOURNAL OF APPLIED STATISTICS Linear functional relationship model for circular variables As mentioned earlier, the functional relationship model is a part of general class of EIVM, in which the underlying variables are deterministic or fixed [18]. In ordinary linear regression model, for any pair of observations (x i, y i ), i = 1, 2,..., n, it is assumed that x values are observed without error, and only y-variable is observed with error, where in this case, x is known as explanatory variable and y is known as response variable. However, in EIVM, both x and y areobservedwitherror.inaddition,ineivmmodel,thereisnodistinction between explanatory and response variable. It is worthwhile to note that data that fits in the functional relationship model is different from functional data. Functional data deals with the theory of data that are in the form of functions, images and shapes, or more general objects [6].They are no longer a set of discretetimepointvaluesbutacontinuouscurveortermedasfunctionaldata.thestatistical methodology in analysing functional data is called as functional data analysis [21]. Thispaperisfocusingoncirculardatawhichhasalinearfunctionalrelationship.The functional relationship model for circular data was first introduced in 1997 and is known as unreplicated linear functional relationship model [11]. When β = 1, the model is known as Caires and Wyatt model, in connection with the desired symmetry of the functional relationship model [5]. The functional relationship model for this study can be expressed as Y = X + α(mod 2π). (3) With x i = X i + δ i and y i = Y i + ε i where i = 1, 2,..., n forsomerotationparameterα The random errors δ i and ε i areassumedtobeindependentlydistributedwithvonmises distribution with δ i VM(0, κ)and ε i VM(0, ν), respectively. The estimation of parameter may be obtained by Maximum Likelihood Estimation which involves some iterative technique [17] The single linkage of hierarchical agglomerative clustering technique in detecting outliers Outlierdetectionisimportantinmanydataanalysistoensuretherobustnessoftheestimation. Clustering algorithm are often used to detect outliers in linear data. One of the common method is using the hierarchical agglomerative clustering where we start by defining each data point to be a cluster and combine existing cluster using linkages. In a single linkage method, groups are formed from the individual object by merging nearest neighbours, where the term nearest neighbour is the connotation of the smallestdistanceorlargestsimilarity[14]. Initially, we have to find the minimum distance in D ={d ik } and merge the corresponding objects, say, U and V, to get the cluster of (UV). Next is to find the distances between (UV)andanyotherclusterW which is d (uv)w = min{d uw, d vw }. (4) The quantities d uw is the distance between the nearest neighbours of clusters U and W; and d vw is the distance between the nearest neighbours of clusters V and W. The results of single linkage clustering can be displayed graphically in the form of a tree diagram, or dendrogram. The branches in the tree represent clusters and they merge at

5 4 N. A. MOKHTAR ET AL. nodeswhosepositionsalongadistance(orsimilarity)axisindicatethelevelatwhichthe fusions occur. In this paper, a clustering-based procedure is developed for the predicted and the residualvaluesofthedatatodetecttheoutliersforcairesandwyattlinearfunctional relationship model Y = X + α(mod2π) for the range [0, 2π) radian. In view of the wrapped around nature of angles for circular data, we define the measure of similarity as p d ij = (π π θ ik θ jk )mod2π, (5) k=1 where d ij is the distance between observation i and j, p is the number of variables and θ ik is the value of the kth variable for the ith observation. Note that mod2π is multiplied to the d i as the data measured here is for the range of (0, 2π]. To estimate the number of target clusters and to decide the optimal level of a dendrogram can be quite challenging [15]. A fixed probability interval can be calculated around the mean direction μ and the circular standard deviation of von Mises distribution can be interpreted by the method proposed [8]. The interval in the distribution of the form (μ ± θ p ) contains a specified percentage of distribution. For example, the values θ 68.3 and θ 95.4 are such that μ ± θ 68.3 and μ ± θ 95.4 are intervals containing exactly 68.3% and 95.4%, respectively [8]. Some multipliers ξ P as given by Fisher for a selection of possible P-values are: P% = 75%, ξ 75 = 1.12, (κ 0.35, ρ 0.17), (6) P% = 90%, ξ 75 = 1.69, (κ 0.65, ρ 0.31), (7) P% = 95%, ξ 75 = 2.06, (κ 0.8, ρ 0.37). (8) In this study, we consider a 95% confidence interval. Therefore the stopping rule for the dendrogram is given by h s h, (9) where h is the average of the heights for all N 1 clusters, s h = 2log R h is the circular standard deviation of the height and R h is the mean resultant length of the heights, R h = C 2 +S 2 n [22].Thismeanstheclustergroupthatexceedsthestoppingruleasgivenin(9)is considered as outliers Power of performance The performance of the power can be measured by the misclassification error namely masking and swamping. Masking is the inability of a detection method to correctly classify a true outlier. That is, the outlier is falsely detected as an inlier [23]. On the other hand, swamping is defined as a detection method classifies an inlier as an outlier. The power of performance of the clustering technique is evaluated by using three measures namely the probability of success, probability of masking error and the probability of swamping error.

6 JOURNAL OF APPLIED STATISTICS 5 Probability of success is given by success pout =, (10) s where success is the number of data set that all of the planted outliers are successfully identified, with s is the number of simulation. Probability of masking error is failure pmask = (out)s, (11) where failure is the number of outliers in all data set that are detected as inliers. Probability of swamping error is false pswamp = (n out)s, (12) where false is the number of inliers in all data set that are detected as outliers and out is the number of the planted outliers Simulation study A simulation study was carried out using SPlus statistical software to study the power of performance of the clustering method. The number of simulation, s is set to be 5000 for each category. The values of X have been generated from the Von Mises distribution of VM ( π 4,3 ) and the value of the rotation parameter, α = π 4 = with different values of the concentration parameters of the error term considered namely κ = 5, 10, 15 and 20. For each value of κ, thesamplesizearen = 30, 50, 100 and 130 respectively. We also fix the number of outliers to be three at three random points of d 1, d 2 and d 3.Thethreeobservations of Y variables at position d t are set to be the outliers by a contamination given by (13) [2], where t = 1,2,3. Y dt =Y dt + ωπ(mod2π), (13) where Y dt isthevalueafterthecontaminationandω isthedegreeofthecontamination in the range of 0 ω 1. The scope of this study considers only the case when Y variable as contaminated and not the X variable. The parameters of the generated data are estimated for the Caires and Wyatt model. Then, the predicted values of ŷ and the residual values ê = ŷ y are obtained and are being clustered by single linkage of hierarchical clustering method as mentioned earlier. 3. Results and discussions 3.1. Simulation results Table 1 shows the probability of success (pout) of the clustering technique. For fixed n,the probability of success increases as the concentration parameter and the level of contamination increase. The highest concentration parameter and highest level of contamination result in the highest probability of success in which the value is very close to 1.

7 6 N. A. MOKHTAR ET AL. Table 1. Probability of success of the clustering technique. n ω κ = 5 κ = 10 κ = 15 κ = To compare the probability of success for different values of κ, we plot the results for particular n, sayn = 100asgiveninFigure1. Wenotethatthebiggertheκ value, the faster the probability of success approaches 1 with increased level of contamination. As mentioned earlier, another measure of performance is the masking error. Masking results in skewed mean and covariance estimates, thus results in distance between outliers and mean is small. Table 2 shows the probability of masking of the clustering technique. In this simulation study, we note that the probability of masking error decreases when both the level of contamination and the concentration parameter increase. The highest concentration parameter and highest level of contamination result in the lowest probability of masking error in which the value is very near to 0. Alternatively, to illustrate the effect of level contamination for different κ values, we plottheresultforn = 100 as shown in Figure 2. It is shown that the bigger the κ value, the faster is the decrease on the probability of masking as we increase the level of contamination. Another measure is the swamping effect. Swamping occurs when outlying observations skew the mean and covariance estimates towards it and away from other non-outlying observations, thus making the distance of non-outlying to the mean large. This makes them look like outliers. Table 3 shows the probability of swamping of the clustering technique. The probability of swamping error decreases when the level of contamination and the concentration parameter increase. We note that the highest probability of masking error is not more than Also for n = 100, it is observed that the larger the κ value, the faster is the decrease is the probability of swamping as we increase the level of contamination. Figure 3 is plotted to illustrate the trend of the swamping error of this clustering method.

8 JOURNAL OF APPLIED STATISTICS 7 Probability of "success" Level of contamination kappa=5 kappa=10 kappa=10 kappa=20 Figure 1. Plot of probability of success (pout) versus the level of contamination for n = 100. Table 2. Probability of masking of the clustering technique. n ω κ = 5 κ = 10 κ = 15 κ = Application to real data Previous researchers of circular statistics such as Hussin et al. [12] and Shamsudheen [24]haveusedwinddirectiondatatakenfromHumbersideCoast,UK,developedbyUK Rutherford and Appleton Laboratories, for the illustration of the presence of outliers, with sample size of 129. Variable x of the data shows that the data is measured by the techniques of HF radar system. It uses pulse radar and operates at frequency of MHz. Meanwhile variable y is measured by using the techniques of anchored wave buoy. They

9 8 N. A. MOKHTAR ET AL. Probability of masking Level of contamination kappa=5 kappa=10 kappa=15 kappa=20 Figure 2. Plot of probability of masking (pmask) versus the level of contamination for n = 100. Table 3. Probability of swamping of the clustering technique. n ω κ = 5 κ = 10 κ = 15 κ = have established that observations 38 and 111 as the outliers of the data set and it has been established that the data can be modelled using Caires and Wyatt model [17]. Thus, we illustrate the presence of the outliers on the same data set using the new clustering method to detect outliers for Caires and Wyatt linear functional relationship model. Figure 4 shows the tree diagram, or the dendrogram of the wind direction data of Humberside Coast with the mean of the height, h of and the standard deviation of height, s h equals to The cut height of the dendrogram based on Equation (9) is It can

10 JOURNAL OF APPLIED STATISTICS 9 swamping of Probability Level of contamination kappa=5 kappa=10 kappa=15 kappa=20 Figure 3. Plot of probability of swamping (pswamp) versus the level of contamination for n = 100. Cut height = Figure 4. The plot of tree diagram with a cut-off at the height of for wind direction data of Humberside Coast, UK. be seen that there are 3 groups of observations, where Group 1 contains most of the observations, Group 2 contains observation 111 and Group 3 contains observation 38. Therefore, from this tree diagram, we conclude that Group 3 contains the inliers and, observations 111 and 38 are the outliers with 95% of confidence. Another real data used to illustrate the applicability of this method is on wind direction data collected at Bayan Lepas Airport, Penang, Malaysia located at 16.3 m above ground level, latitude N and longitude Ewithn = 62. The variable x in this data is wind direction at pressure of 850 Hpa with 5000 m height, meanwhile the y variable is wind direction at pressure of 1000 Hpa with 300 m height. Figure 5 shows the dendrogram of the

11 10 N. A. MOKHTAR ET AL. Cut height = Figure 5. The plot of tree diagram with a cut-off at the height of for wind direction data of Bayan Lepas, Malaysia. data, where the cut of height for the dendrogram of this data is The outliers detected are observations 47, 12 and 57 with 95% of confidence. 4. Summary In this paper, we consider single linkage of hierarchical clustering method for the detectionofmultiple outliersincircular datawith therange of [0,2π)radians. The dendrogram produced has a cut-off at the height of h s h where h is the average of the heights for all N 1clusters,s h = 2log R h is the circular standard deviation of the height and R h is the mean resultant length of the heights. Potential outliers are classified at 95% confidence level when the data exceed the stopping rule. Results from the simulation study confirm the validity of the use of the hierarchical clustering technique with low probability of swamping and masking. The method is illustrated by using the real data set of wind direction of Humberside Coast UK and Bayan Lepas Airport, Malaysia. Moreover, our method correctly identifies the outliers as established by other studies. Disclosure statement No potential conflict of interest was reported by the authors. Funding We would like to thank National Defence University of Malaysia and University of Malaya [research grant BKS ] for supporting this work.

12 JOURNAL OF APPLIED STATISTICS 11 References [1] A.H. Abuzaid, A.G. Hussin, and I.B. Mohamed, Detection of outliers in simple circular regression models using the mean circular error statistic, J.Stat.Comput.Simul.83(2013), pp [2] A.H. Abuzaid, I.B. Mohamed, A.G. Hussin, and A. Rambli, COVRATIO statistic for simple circular regression model, Chiang Mai J. Sci. 38 (2011), pp [3] R. Adnan, M.N. Mohamad, and H. Setan, Multiple outliers detection procedures in linear regression, Matematika 19 (2003),pp [4] E. Batschelet, Circular Statistic in Biology, AcademicPress, London, [5] S. Caires and L.R. Wyatt, A linear functional relationship model for circular data with an application to the assessment of ocean wave measurement, J. Agric. Biol. Environ.Stat. 8 (2003), pp [6] J.C. Davis, Statistics and Data Analysis in Geology, 3rded., WileyIndia, [7] T.D. Down and K.V. Mardia, Circular regression, Biometrika89 (2002),pp [8] N.I. Fisher, Problems with the current definitions of the standard deviation of wind direction, J.Clim.Appl.Meteorol.26(1987), pp [9] F.E. Grubbs, Procedures for detecting outlying observations in samples, Technometrics11(1969), pp [10] S.F. Hassan, A.G. Hussin, and Y.Z. Zubairi, Estimation of functional relationship model for circular variables and its application in measurement problem, Chiang Mai J. Sci. 37 (2010), pp [11] A.G. Hussin, A. G. Pseudo-replication in functional relationship with environmental application, Ph.D. thesis, University of Sheffield, England, [12] A.G.Hussin,A.Abuzaid,F.Zulkifli,andI.Mohamed,Asymptotic covariance and detection of influential observation in a linear functional relationship model for circular data with application to the measurements of wind directions,sci.asia36(2010), pp [13] S.R. Jammaladaka and A. Sengupta, Topics in Circular Statistics, World Scientific Publishing, [14] R.A. Johnson and D.W. Wichern, AppliedMultivariateStatisticalAnalysis, Pearson, New Jersey, [15] Y.Jung,H.Park,D.Du,andB.L.Drake,A decision criterion for the optimal number of clusters in hierarchical clustering,j.globaloptim.25(2003), pp [16] K.V.Mardia and P.E.Jupp,Directional Statistics, John Wiley & Sons, West Sussex, [17] N.A. Mokhtar, Y.Z. Zubairi, and A.G. Hussin, A simple linear functional relationship model for circular variables and its application, Proceedings of the 9th International Conference on Renewable Energy Sources (RES 15), Kuala Lumpur, Malaysia, 2015, pp [18] N.A. Mokhtar, Y.Z. Zubairi, and A.G. Hussin, Parameter estimation of simultaneous functional relationship model for circular variables assuming equal error variances, Pakistan J. Statist. 31 (2015), pp [19] S.M.A.K. Rahman, M.M. Sathik, and K.S. Kannan, Multiple linear regression models in outlier detection,int.j.res.comput.sci.2(2012), pp [20] A. Rambli, I. Mohamed, and A.H.M. Abuzaid, Identification of influential observations in circular regression model, Proceedings of the Regional Conference on Statistical Sciences (RCSS 10), 2010, pp [21] J.O. Ramsay and B.W. Silverman, Functional Data Analysis,Springer,NewYork,1997. [22] S.Z. Satari, Parameter estimation and outlier detection for some types of circular model, Ph.D. thesis, University of Malaya, Malaysia, [23] D.M. Sebert, D.C. Montgomery, and D.A. Rollier, A clustering algorithm for identifying multiple outliers in linear regression, Comput. Statist. Data Anal. 27 (1998),pp [24] M.I. Shamsudheen, Bootstrapping and outlier detection problems in linear functional relationship model for circular data, Master thesis, Universiti Pertahanan Nasional Malaysia, Malaysia, 2014.

On Development of Spoke Plot for Circular Variables

Chiang Mai J. Sci. 2010; 37(3) 369 Chiang Mai J. Sci. 2010; 37(3) : 369-376 www.science.cmu.ac.th/journal-science/josci.html Contributed Paper On Development of Spoke Plot for Circular Variables Fakhrulrozi