Research Article Outlier Removal Approach as a Continuous Process in Basic K-Means Clustering Algorithm

Size: px

Start display at page:

Download "Research Article Outlier Removal Approach as a Continuous Process in Basic K-Means Clustering Algorithm"

Griffin Stafford
6 years ago
Views:

1 Research Joural of Applied Scieces, Egieerig ad echology 7(4): , 4 DOI:.96/rjaset.7.35 ISSN: ; e-issn: Maxwell Scietific Publicatio Corp. Submitted: April 4, 3 Accepted: April, 3 Published: Jauary 7, 4 Research Article Outlier Removal Approach as a Cotiuous Process i Basic K-Meas Clusterig Algorithm Dauda Usma ad Ismail Bi Mohamad, Departmet of Mathematical Scieces, Faculty of Sciece, Uiversiti ekologi Malaysia, 83, UM Johor Bahru, Johor Darul a azim, Malaysia Abstract: Clusterig techique is used to put similar data items i a same group. K-mea clusterig is a commoly used approach i clusterig techique which is based o iitial cetroids selected radomly. However, the existig method does ot cosider the data preprocessig which is a importat task before executig the clusterig amog the differet database. his study proposes a ew approach of k-mea clusterig algorithm. Experimetal aalysis shows that the proposed method performs well o ifectious disease data set whe compare with the covetioal k- meas clusterig method. Keywords: Ifectious diseases, k-meas clusterig, pricipal compoet aalysis, pricipal compoets, stadardizatio INRODUCION Data aalysis techiques are ecessary o studyig actually icreasig huge rage of large sizig data. Regardig the same edge, cluster aalysis (Hastie et al., ) tries to pass through data easily to achieve st structure experiece by dividig data items straight ito disjoit classes i a way that data items owed by idetical cluster are the same whereas data items owed by aother clusters ted to be differet. Amog the sigificat well kow as well as effective clusterig techiques is kow as the K-meas techique (Hartiga ad Wag, 979) utilizig prototypes (cetroids) so as to sigify clusters through perfectig the error sum squared operatio. (he specifics report for K-meas as well as relevat techiques has bee provided i (Jai ad Dubes, 988). he computatioal difficulty with traditioal K- meas algorithm is extremely large, specifically with regard to huge data uits. Moreover the amout of distace computatios rises greatly with the icrease with the dimesioality of the data. Whe the dimesioality icreases usually, just a few dimesios are highly relevat to specific clusters, however data o the uimportat dimesios may possibly geerate extremely very much oise ad also coceal the true clusters that will possibly be observed. Furthermore wheever dimesioality elevates, data ormally tur out to be extremely short, data elemets positioed o separate measuremets may be regarded virtually all equally distaced as well as the distace amout, that, primarily for groupig exploratio, turs ito useless. herefore, feature reductio or just dimesioality lesseig is the cetral data-preprocessig approach regardig cluster aalysis for datasets which has a huge umber of features. However, huge dimesioal data are sometimes ehaced ito reduce dimesioal data through Pricipal Compoet Aalysis (PCA) (Jolliffe, ) (or sigular value decompositio) whereby coheret patters could be detected more easily. his type of usupervised dimesio reductio is commoly employed i tremedously broad areas which icludes meteorology, image processig, geomic aalysis ad iformatio retrieval. It is additioally well-kow that PCA ca be used to project data ito a reduced dimesioal subspace ad the K-meas will the be applied to the subspace (Zha et al., ). I other istaces, data are embedded i a low-dimesioal space just like the eigespace from the graph Laplacia ad K-meas will the be employed (Ng et al., ). A very importat reaso for PCA reliat dimesio lowerig is that ofte it holds the dimesios cosiderig the mai variaces. his is the same with locatig the optimal low rak approximatio (i L orm) for the data employig the SVD (Eckart ad Youg, 936). Also, the dimesio lowerig property o its ow is actually iadequate i order to elucidate the potecy of PCA. O this study, we take a look at the lik cocerig both of these frequetly used approaches ad also a data stadardizatio process. We show that pricipal compoet aalysis ad stadardizatio approaches are basically the cotiuous solutio for the cluster Correspodig Author: Dauda Usma, Departmet of Mathematical Scieces, Faculty of Sciece, Uiversiti ekologi Malaysia, 83, UM Johor Bahru, Johor Darul a azim, Malaysia his work is licesed uder a Creative Commos Attributio 4. Iteratioal Licese (URL: 77

2 membership idicators o the K-meas clusterig techique, i.e., the PCA dimesio reductio automatically executes data clusterig i lie with the K-meas objective fuctio. his gives a essetial justified reaso of PCA-based data reductio. he result also provides best ways to address the K-meas clusterig problem. K-meas techique employs K prototypes, the cetroids of clusters, to characterize the data. hese are determied by miimizig error sum of squares. K-meas clusterig algorithm: A covetioal procedure for k-meas clusterig is straightforward. Gettig started we ca decide amout of groups K ad that we presume a cetroid or ceter of those groups. Immediately cosider ay kid of radom items as iitial cetroids or a first K items withi the series which ca also fuctio as a iitial cetroids. After that the K-meas techique will perform the 3 stages listed here before covergece. Iterate util costat (= zero item move group): Decide the cetroid coordiate Decide the legth of every item to the cetroids Cluster the item accordig to miimal legth Pricipal compoet aalysis: PCA ca be looked at mathematically as the trasformatio of the liear orthogoal of the data to a differet coordiate so that the largest variace of ay of the data projectios lie o the first coordiate (kow as the first pricipal coordiate), the ext largest o the secod coordiate ad so o. It trasforms a umerous possibly correlated variables ito a compact quatity of ucorrelated variables called pricipal compoets. PCA is a statistical techique for determiig key variables i a high dimesioal dataset which accouts for differeces i the observatios ad is very importat for aalysis ad visualizatio where iformatio is very little lackig. Pricipal compoet: Pricipal compoets ca be determied by the Eige value decompositio of a data sets correlatio matrix/covariace matrix or SVD of the data matrix, ormally after mea ceterig the data for every feature. Covariace matrix is preferred whe the variaces of features are extremely large o compariso to correlatio. It will be best to choose the type of correlatio oce the features are of various types. Likewise SVD method is employed for statistical precisios. LIERAURE REVIEW May efforts have bee made by researchers to ehace the performace as well as efficiecy of the traditioal k-meas algorithm. Pricipal Compoet Aalysis by Valarmathie et al. (9) ad Ya et al. (6) is kow as a usupervised Feature Reductio techique meat for projectig huge dimesioal data Res. J. Appl. Sci. Eg. echol., 7(4): , 4 77 ito a ew reduced dimesioal represetatio of the data that explais as much of the variace withi the data as possible with miimum error recostructio. Chris ad Xiaofeg (6) Proved that pricipal compoets remai the cotiuous approaches to the discrete cluster membership idicators for K-meas clusterig ad also, proved that the subspace spaed through the cluster cetroids are give by spectral expasio of the data covariace matrix trucated at K- terms. he effect sigifies that usupervised dimesio reductio is directly related to usupervised learig. I dimesio reductio, the effect gives ew isights to the observed usefuless of PCA-based data reductios, beyod the traditioal oise-reductio justificatio. Mappig data poits right ito a higher dimesioal space by meas of kerels, idicates that solutio for Kerel K-meas provided by Kerel PCA. I learig, fial results suggest effective techiques for K-meas clusterig. I (Dig ad He, 4), PCA is used to reduce the dimesioality of the data set ad the the k-meas algorithm is used i the PCA subspaces. Executig PCA is the same as carryig out Sigular Value Decompositio (SVD) o the covariace matrix of the data. Karthikeyai ad hagavel (9) Employs the SVD techique to determie arbitrarily orieted subspaces with very good clusterig. Karthikeyai ad hagavel (9) exteded K- meas clusterig algorithm by applyig global ormalizatio before performig the clusterig o distributed datasets, without ecessarily dowloadig all the data ito a sigle site. he performace of proposed ormalizatio based distributed K-meas clusterig algorithm was compared agaist distributed K-meas clusterig algorithm ad ormalizatio based cetralized K-meas clusterig algorithm. he quality of clusterig was also compared by three ormalizatio procedures, the mi-max, z-score ad decimal scalig for the proposed distributed clusterig algorithm. he comparative aalysis shows that the distributed clusterig results deped o the type of ormalizatio procedure. Alshalabi et al. (6) desiged a experimet to test the effect of differet ormalizatio methods o accuracy ad simplicity. he experimet results suggested choosig the z-score ormalizatio as the method that will give much better accuracy. Removal of the weaker pricipal compoets: he trasformatio o the data set to the ew pricipal compoet axis provides the umber of PCs same as the umber i the iitial features. Although for various data sets, the first few PCs metio most of the variaces ad so the others ca easily be elimiated with miimum loss of iformatio. MAERIALS AND MEHODS Let Y = {X, X,, X } imply the d-dimesioal raw data set. he the data matrix is a d matrix give by:

3 a K a d X, X,..., X = M O M. a a L d Res. J. Appl. Sci. Eg. echol., 7(4): , 4 () he z-score is a form of stadardizatio used for trasformig ormal variats to stadard score form. Give a set of raw data Y, the z-score stadardizatio formula is defied as: ( ij) x x ij j ij = Z x = () σ j x where, j ad σ j are the sample mea ad stadard deviatio of the j th attribute, respectively. he trasformed variable will have a mea of ad a variace of. he locatio ad scale iformatio of the origial variable has bee lost (Jai ad Dubes, 988). Oe importat restrictio of the z-score stadardizatio Z is that it must be applied i global stadardizatio ad ot i withi-cluster stadardizatio (Milliga ad Cooper, 988). Pricipal compoet aalysis: Let v = (v, v,, v ) be a vector of d radom variables, where is the traspose operatio. he first step is to fid a liear fuctio of the elemets of v that maximizes the variace, where α is a d-dimesioal vector (,, ) so: a v= a v. (3) i i i= After fidig,,,, we look for a liear fuctio that is ucorrelated with,,, ad has maximum variace. he we will fid such liear fuctios after d steps. he j th derived variable jv is the j th PC. I geeral, most of the variatio i v will be accouted for by the first few PCs. o fid the form of the PCs, we eed to kow the covariace matrix Σ of v. I most realistic cases, the covariace matrix Σ is ukow ad it will be replaced by a sample covariace matrix. hat is for j =,,..., d, it ca be show that the j th PC is: z =, where a j is a eigevector of Σ correspod with the j th mai eigevalue λ j. I fact, i the first step, z = ca be foud by solvig the followig optimizatio problem: Maximize var ( v) subject to a =, where, var ( v) is computed as: var ( j v) = j Σ a o solve the above optimizatio problem, the techique of Lagrage multipliers ca be used. Let λ be a Lagrage multiplier. We wat to maximize: ( ) aσa a a (4) λ 773 Differetiatig Eq. (4) with respect to a, we have: or, Σa - λa = ( Σ λ I ) a = d where, I d is the d d idetity matrix. hus λ is a eigevalue of Σ ad a is the correspodig eigevector. Sice, a Σ a = a λ a = λ a is the eigevector correspodig with the mai eigevalue of Σ. I fact, it ca be show that the j th PC is, where a j is a eigevector of Σ correspodig to its j th largest eigevalue λ j (Jolliffe, ). Sigular value decompositio: Let D = {x, x,, x } be a umerical data set i a d-dimesioal space. he D ca be represeted by a d matrix X as: X = ( xij) d where, x ij is the j-compoet value of x i. Let =(,,, ) be the colum mea of X: xij i= µ j =, j =,,..., d Ad let e be a colum vector of legth with all elemets equal to oe. he SVD expresses X - e as: X e USV µ = (5) where, U is a colum orthoormal matrix, i.e., U U = I is a idetity matrix, S is a d diagoal matrix cotaiig the sigular values ad V is a d d uitary matrix, i.e., V H V = I, where V H is the cojugate traspose of V. he colums of the matrix V are the eigevectors of the covariace matrix C of X; precisely: C = X X µ µ = VΛ V (6) Sice C is a d d positive semi defiite matrix, it has d oegative eigevalues ad d orthoormal eigevectors. Without loss of geerality, let the eigevalues of C be ordered i decreasig order: λ λ λ d. Let σ j (j =,,, d) be the stadard deviatio of the j th colum of X, i.e.: σ j = ( xij µ j ) i= he trace Σ of C is ivariat uder rotatio, i.e.:

4 Res. J. Appl. Sci. Eg. echol., 7(4): , 4 d d σ j j= j= Σ= = λ j Notig that e X = ad e e = from Eq. (5) ad (6), we have: VS SV = VS U USV ( X e µ ) ( X e µ ) = Cluster Cluster Cetroids = X X µ e X X e µ + µ e e µ 4 = X X µ µ = VΛ V (7) Fig. : Basic K-meas algorithm Sice V is a orthoormal matrix, from Eq. (7), the sigular values are related to the eigevalues by: S = λ = j j, j,,..., d he eigevectors costitute the PCs of X ad ucorrelated features will be obtaied by the trasformatio Y = (X - e ) V. PCA selects the features with the highest eigevalues. K-meas clusterig: Provided some series ivolvig observatios (x, x,, x ), i which each observatio is kow as a d-dimesioal real vector, k-meas clusterig is desiged to partitio a observatios to k uits (k = ) S = S, S,, S k as a way to reduce the Withi-Cluster Sum of Squares (WCSS): k arg mi / / / / S x i x S j µ = i j i (8) at which µ i stads out as the mea for items withi S i. RESULS AND DISCUSSION he presece of oise i a large amout of data is easily filtered out by the ormalizatio ad PCA/SVD preprocessig stages, especially sice such a treatmet was specifically desiged to deoise large umerical values while preservig edges. I this sectio, we examie as well as evaluate the tasks for the approaches below: covetioal k-meas with the origial dataset, k-meas with ormalized dataset, k-meas with PCA/SVD dataset ad k-meas with ormalized ad PCA/SVD dataset seeig as methods of respose to the goal itet behid the k- meas techique. he level of a particular clusterig are as well be evaluated, whereby level is aalyzed with the error sum of squares for the itra-cluster rage, that is a rage amog data vectors i a group as well as the cetroid for the group, the lesser the sum of the 774 differeces is, the better the accuracy of clusterig ad the error sum of squares. Figure presets the result of the basic K-meas algorithm usig the origial dataset havig data objects ad 7 attributes as show i able. wo poits attached to cluster ad four poits attached to cluster are out of the cluster formatio with the error sum of squares equal.. he umber of PCs foud is i fact same with the actual umber of iitial features. o remove the weakeed compoets out of the PC set we worked out the correspodig variace, its percetage ad cumulative percetage, show i able ad 6. here after we cosidered the PCs with variaces lower tha the mea variace, disregardig others. he lesseed PCs are show i able 3 ad 7. able presets the variaces, the percetage of the variaces ad cumulative percetage which correspods to the pricipal compoets. Figure explaied the pareto plot of for the variaces percetages agaist the pricipal compoet for the origial dataset havig data objects ad 7 variables. he improve matrix usig lesseed PCs has bee made this also trasformed matrix is simply employed o the iitial dataset to geerate a differet lesseed estimated dataset, that will be utilized for the remaiig data exploratio ad also reduced dataset cotaiig 4 attributes is also show i able 4. Figure 3 presets the result of the K-meas algorithm applyig pricipal compoet aalysis to the origial dataset. he reduced datasets cotaiig data objects ad 4 attributes as show i able 4 ad all the poits attached to both cluster ad are withi the cluster formatio with the error sum of squares equal Figure 4 presets the result of the K-meas algorithm usig the rescale dataset with z-score stadardizatio method, havig data objects ad 7 attributes as show i able 5. All the poits

5 Res. J. Appl. Sci. Eg. echol., 7(4): , 4 able : he origial datasets with data objects ad 7 attributes X X X3 X4 X5 X6 X7 Day Day Day Day Day Day Day Day Day Day 5 7 Day Day Day Day Day Day Day Day Day Day able : he variaces cumulative percetages Cumulative Variaces Percetage of variaces percetage of variaces PC PC PC PC PC PC PC able 3: Reduced PCs with variaces greater tha mea variace PC PC PC3 PC able 4: he reduced data set with data objects ad 4 attributes X X X3 X4 Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day attached to both cluster ad are withi the cluster formatio with the error sum of squares equal able 6 presets the variaces, the percetage of the variaces ad cumulative percetage which correspods to the pricipal compoets. he improve matrix usig lesseed PCs (able 7) maufactured this also trasformed matrix simply employed o a stadardized dataset so as to geerate differet lesseed estimated dataset, that will be utilized for the remaiig data exploratio ad the lesseed dataset cotaiig 4 attributes show i able 8. Figure 5 presets the result of the K-meas algorithm applyig stadardizatio ad pricipal compoet aalysis to the origial dataset. he reduced datasets cotaiig data objects ad 4 attributes as show i able 8 ad all the poits attached to both cluster ad are withi the cluster formatio with the error sum of squares equal 5.6. able 5: he stadardized dataset with data objects ad 7 attributes X X X3 X4 X5 X6 X7 Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day

6 Res. J. Appl. Sci. Eg. echol., 7(4): , 4 able 6: he variaces cumulative percetages Cumulative Variaces Percetage of variaces percetage of variaces PC PC PC PC PC PC PC Cluster Cluster Cetroids able 7: Reduced PCs with variaces greater tha mea variace PC PC PC3 PC able 8: he reduced dataset with data objects ad 4 attributes X X X3 X4 Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Day Fig. 3: K-meas with PCA/SVD Fig. 4: K-meas algorithm with stadardized dataset.5.5 Cluster Cluster Cetroids Cluster Cluster Cetroids 8 % 7 87% % variaces Explaied (%) % 5% 37% 5% % Pricipal Compoet Fig. : Pareto plot of variaces ad pricipal compoets CONCLUSION We have proposed a ovel hybrid umerical algorithm that draws o the speed ad simplicity of k- % 776 Fig. 5: K-meas with rescaled ad PCA/SVD datasets meas procedures. he result of the cluster aalysis show i Fig. to 5 by usig the basic k-meas algorithm with the origial data set, k-meas clusterig algorithm applyig pricipal compoet aalysis to the origial dataset, k-meas clusterig algorithm with the stadardized data set ad proposed k-meas clusterig algorithm to the reduced data set respectively, shows

7 Res. J. Appl. Sci. Eg. echol., 7(4): , 4 the cotiuity solutios of the k-meas clusterig techique ad guaratees the time reductio for clusterig as a result of smaller umber of features. Also i compariso the results of the aalysis obtaied by the stadard k-meas algorithm with the proposed k- meas algorithm the sum of squares error are., 43.4, ad 5.6 respectively. his also shows the reliability as well as efficiecy of the preseted k- meas techique. REFERENCES Alshalabi, L., Z. Shaaba ad B. Kasasbeh. 6. Data miig: A preprocessig egie. J. Comput. Sci., (9): Chris, D. ad H. Xiaofeg, 6. K-meas clusterig via pricipal compoet aalysis. Proceedig of the st Iteratioal Coferece o Machie Learig. Baff, Caada. Dig, C. ad X.X. He, 4. K-meas clusterig via pricipal compoet aalysis. Proceedig of the st Iteratioal Coferece o Machie Learig. ACM Press, New York. Eckart, C. ad G. Youg, 936. he approximatio of oe matrix by aother of lower rak. Psychometrika, : -8. Hartiga, J. ad M. Wag, 979. A K-meas clusterig algorithm. Appl. Stat., 8:-8. Hastie,., R. ibshirai ad J. Friedma,. Elemets of Statistical Learig. Spriger Verlag, New York. Jai, A. ad R. Dubes, 988. Algorithms for Clusterig Data. Pretice Hall, New York. Jolliffe, I.,. Pricipal Compoet Aalysis. d Ed., Spriger Series i Statistics. Spriger-Verlag, New York. Karthikeyai, V.N. ad K. hagavel, 9. Impact of ormalizatio i distributed k-meas clusterig. It. J. Soft Comput., 4(4): Milliga, G. ad M. Cooper, 988. A study of stadardizatio of variables i cluster aalysis. J. Classif., 5: 8-4. Ng, A., M. Jorda ad Y. Weiss,. O spectral clusterig: Aalysis ad a algorithm. Proceedig of the Neural Iformatio Processig Systems (NIPS ). Valarmathie, P., M. Sriath ad K. Diakara, 9. A icreased performace of clusterig high dimesioal data through dimesioality reductio techique. J. heor. Appl. Iform. echol., 3: Ya, J., B. Zhag,, N. Liu, S. Ya, Q. Cheg, W. Fa, Q. Yag, W. Xi ad Z. Che, 6. Effective ad efficiet dimesioality reductio for large scale ad streamig data preprocessig. IEEE. Kowl. Data Eg., 8(3): Zha, H., C. Dig, M. Gu, X. He ad H. Simo,. Spectral relaxatio for K-meas clusterig. Neu. If. Pro. Syst., 4:

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm