Performance of Dfferent Algorthms on Clusterng Molecular Dynamcs Trajectores Chenchen Song Abstract Dfferent types of clusterng algorthms are appled to clusterng molecular dynamcs trajectores to get nsght about possble conformatons for molecules.. The algorthms covered nclude, (multvarate Gaussan), snglelnage, centrod-lnage, average-lnage, complete-lnage. Root-Mean-Square-Devaton( RMSD) s used as metrc. Performances of algorthms are analyzed and compared based on Daves-Bouldn ndex(dbi) and Pseudo-F statc (psf). 1. Introducton Molecular dynamcs smulaton methods produce trajectores of molecule confguraton snapshots as a functon of tme. The tme scale of chemcal process s ns, whle the tme scale of molecule nternal freedom s fs, thus a suffcent smulaton trajectory wll need to contan at least O(N 6 ) confguratons. For such large amounts of data, machne learnng algorthm becomes helpful to extract useful nformaton from the datasets. One type of nformaton we hope to get s the conformaton substates of a molecule, whch falls nto the clusterng problem. Usng clusterng algorthm to help analyze molecular dynamcs trajectores s actually not a new dea, and can date bac to 1993. Snce then, a number of papers about applyng dfferent types of clusterng algorthms on MD have been publshed. Thus, n ths project, I wll focus on comparng the performance of dfferent clusterng algorthm. 2. Method Detals of the methods have been carefully dscussed n the mlestone report. Here we only gve a bref revew. (1) Smlarty Metrc Instead of normal Eucldean norm, RMSD wll be used as metrc, whch frst tres to algn two molecule as much as possble before calculatng Eucldean dstance. By usng RMSD, we can elmnate the effect from translatonal and rotatonal moton of molecule. (2) Algorthm and multvarate Gaussan(wll be called for short) have been ntroduced n class. The lnage methods are dfferent n how the dstance between clusters s defned. Sngle(edge)-lnage uses the shortest nter-cluster pont-to-pont dstance. Centrod-lnage uses the dstance between cluster centrods. Average-lnage: uses average dstances between ndvdual ponts of the two clusters. Completelnage uses maxmal pont-to-pont dstance. Due to ther dfferent defnton of crtcal dstance, latter we wll see that they sometmes can have very dfferent behavor. Sngle-lnage, complete-lnage, and average-lnage only need to calculate a metrc matrx of sze N 2 at the very begnnng. Other methods need to update the postons of centrods and relatve dstances from ponts to centrods durng each teraton. Because our metrc s not a smple Eucldean dstance, the latter methods wll be more tme-consumng. (3) Performance Metrc DBI s defned as 1 d dj DB D D (1.1) max j j, j 1 d, j where d j s dstance between centrods and d s the average dstance between ponts n some cluster wth the centrod of that cluster. psf s defned as: SS N B psf SS 1 w 2 2 B, w 1 1 xc SS n m m SS x m (1.2) where m s centrod and m s the overall mean of data. Usually, lower DBI and hgher psf reflects compact and well-separated DBI. But one should be careful when usng these ndces. For example, DBI s affected by cluster count, we should only compare DBI values when the number of clusters s smlar. 3. Results and Analyss Molecular dynamcs trajectores are generated by Terachem pacage. RMSD calculaton and molecular algnment s performed usng VMD pacage. If the methods only requres N 2 metrc matrx as dscussed n the prevous sectons, then the clusterng s performed by MATLAB statstc box. Otherwse, the clusterng s performed by Cluster3.0, where we have added our metrc nto the clusterng lbrary. DBI and psf are calculated by MATLAB. 3.1 Clusterng ponts n 2D-plane from unform dstrbuton Frst, dfferent clusterng methods are appled to a hundred ponts n 2D plane whch are sampled from unform dstrbuton. The ponts don t have any nternal
Orgnal Ln-complete Ln-sngle Ln-complete Ln-sngle Ln-average Ln-centrod Ln-average Ln-centrod -means -means structures, thus the clusterng results wll only reflect the propertes of dfferent algorthm. Fgure 1. Clusterng results for ponts n 2D-plane drawn from unform dstrbuton From Fg.3, the followng propertes can be observed. (1) Most of the methods tend to naturally and equally partton the ponts nto four blocs. Ths s especally true for -means. (2) Sngle-lnage (or edge-lnage) almost classfes all the ponts nto a same cluster. Ths mght because the crtcal dstance of sngle-lnage s defned as the nearest ponts between two clusters, thus sngle-lnage may be very senstve to cases where ponts are close to each other and no clear border exsts. (3) method s the only one that produces clusters wth very dfferent shapes. Ths mght be method does clusterng based on probablstc assumptons whle all other methods are based on geometrc structures. Fgure 2. Clusterng results for ponts n 2D-plane drawn from three equally szed overlapped Gaussan dstrbuton. Compared to the orgnal fgure, the followng propertes can be notced. (1) The success of -means may agan due to the nternal property of -means to produce clusters wth same sze and same shape. (2) can get pretty good result f the underlyng probablty dstrbuton s very close to Gaussan. (3) Centrod lnage doesn t wor well, perhaps because the centrods are ll-defned. (4) Sngle-lnage seems to fal for ths crcumstance agan where ponts have no clear borders. 3.3. Clusterng Artfcal MD data: Four equally szed clusters 3.2 Clusterng ponts on 2D-plane from overlapped Gaussan dstrbuton. In the second step, clusterng methods are appled to ponts sampled from three ndependent Gaussan dstrbutons. The mean and covarance of 2D Gaussan s tuned so that the three dstrbutons are overlapped. The reason to test on overlapped ponts s that n MD smulatons, the trajectores are generated consequently, thus adjacent confguratons are usually very smlar. Fgure 3. Illustratons of typcal cyclo-hexane conformatons. Cyclo-hexane s used as a test model. Four clusters are char, twsted boat, boat, and half-char. To generate each cluster, tae char as an example, we start from a
L-sngle L-centrod L-Complete L-sngle L-average L-centrod L-Complete L-average char confguraton, control the tmestep to 0.01fs, control the temperature to very low and only proceed 100 steps. Repeat ths procedure several tmes, we can guarantee to generate a cluster wth typcal char confguratons. Because the fve dfferent clusters are generated ndependently and artfcally, there s actually a very clear border between dfferent clusters. The fgure shows the expected dealzed behavor ncludng mnma n the DBI, maxma n psf when clusterng number s equal to the optmal value of 4. DBI psf Ths set s much dffcult to cluster as t has both very small clusters wth small varance and relatve large clusters wth large varance. DBI psf Fgure 4.DBI and psf for clusterng artfcal MD trajectores of cyclo-hexane. Optmal clusters are four equally szed clusters. X-axs range from 3 to 8. By checng the assgnments, t s found that for ths test wth equal clusterng sze and clear border, most of the algorthm perform very well except. Ths mght be because Gaussan dstrbuton s not a good assumpton for how confguratons dstrbute wthn each cluster around the centrod. 3.4 Clusterng artfcal MD data: Fve dfferentally szed clusters. In real MD smulatons, the szes of clusters can be very dfferent, because the lower the energy s, the hgher probablty t wll appear durng smulaton. To mmc ths property, n ths test, we buld an artfcal MD data by combnng: 2 planar structures, 15 half-char structures, 30 boat structures, 50 twst-boat structures, and 100 chars. The order s also consstent wth the energy order. Fgure 5.DBI and psf for clusterng artfcal MD trajectores of cyclo-hexane. Optmal clusters are fve dstnct szed clusters. X-axs range from 3 to 8. Table 1. Sze of clusters for dfferent algorthm wth dfferent number of clusters. Method #cluster Cluster sze 4 46 47 50 54 Kmeans 5 13 41 46 47 50 6 10 15 20 25 45 82 L-sngle 5 2 15 30 50 100 6 2 15 (2, 28) 50 100 L-average 5 2 15 30 50 100 6 2 15 (12, 18) 50 100 L-centrod 5 2 15 30 50 100 6 2 15 (12, 18) 50 100 L-complete 5 2 15 30 50 100 6 2 15 (5, 25) 50 100 4 34 66 108 192 5 21 32 47 100 200 6 43 49 51 57 100 100 It can be notced that when confguratons are not unformly separated, the metrcs are less consstent and not necessary gets optmal value at optmal cluster numbers.
Ln-Centrods Ln-Complete From the table and fgure the followng propertes could be observed (1) For ths tests where clear border exsts but cluster szes are very dfferent, all the lnage methods are able to recover the correct assgnment at optmal cluster number 5. When cluster number ncreases, dfferent methods then splt clusters n dfferent way due to ther dfferent dstance defnton. (2) K-mean exhbts a strong tendency to cluster ponts nto equal sze, thus doesn t gve good performance for ths tests. (3) also gves pretty bad result, possbly due to same reason as prevous tests. Fgure 7. Centrods for each cluster. To explan ths behavor, the energy profle for a contnuous changng cyclohexane s checed. 3.5 Clusterng real MD data. Unle the artfcal trajectores n whch confguratons are clearly separated, confguratons from adjacent steps n real MD smulatons are very smlar and wll mae clusterng to be more dffcult. Cyclohexane s agan used as the test examples. Multple trajectores are generated startng from half-char structure. After clusterng each trajectores ndvdually, two dstnct typcal types of behavor s shown below. Fgure 8. Cyclohexane energy profle. It can be seen from the fgure that, f startng from halfchar, t can ether goes to the left, the char regon( f the plat part flps downwards) or goes to the rght, the boat regon(f the plat fart flps upwards). Once t steps to the left, t wll be separated from the other part by a hgh energy barrer. Smlar for steppng nto the rght regon. All methods gve the almost the same clusterng when cluster number s three. If we set cluster number to two, methods wll have dstnct results. Fgure 6. Two dstnct types of behavor and clusterng results when clusterng real MD trajectores of cyclo-hexane startng from half-char conformaton. (0) Char (2) Upward twst-boat (3) boat (4) Downward twst-boat
RMSD Ln-sngle Ln-Average Fgure 8. Dfferent clusterng results from dfferent algorthm when number of cluster s 2. It can be seen from the fgure that : (1) and lnage-complete method merges upward and downward twst-boat. These two methods emphasze more on the dfference between boat and twst-boat. (2) Ln-centrods and ln-average merge boat and upward twst boat. and wegh more on the dfference between reversed confguratons. (3) Ln-sngle doesn t preform very well for ths tests, perhaps because t cannot handle crcumstances where clear border s absent. 3.6 Clusterng proten trajectores Fnally, clusterng method s appled to proten trajectores. Complete-lnage method s used, because t performs well n the prevous tests and only requres a N 2 metrc matrx, wthout necessty to compute updated centrods durng each teraton. Optmal clusterng number s pced out by psf ndces. Consderng the avalable computng ablty, we apply a pullng force to the proten so that the structure wll change more rapdly. Only the coordnates of bacbones (carbon, ntrogen, oxygen) are passed nto the clusterng program. Results are shown below. Tme Fgure 9. Clusterng results for proten. The three clusters clearly show the transformaton from folded, half-unfolded to completely unfolded under the pullng force we apply to the system. Fgure 10. Centrods for three clusters. Summary In ths project, we use several tests to compare and analyze the performance of dfferent clusterng methods under varous condtons. The followng propertes can be summarzed from our observatons: (1) tends to produce blocy clusters of smlar sze. Thus when cluster szes are smlar, gves very good performance. But t usually fals for clusters wth dstnct szes. (2) Sngle-lnage s very senstve to closely spaced ponts. As a result t may be fragle to the presence or absence of sngle pont. (3) Multvarate Gaussan doesn t wor well for most of the tests, perhaps because the assumpton of Gaussan dstrbuton s not good n our problem (4) Centrod-, average-, and complete-average gves qute consstency good results through the results. The latter two doesn t requre updatng centrods durng each teraton, thus may be canddates for clusterng molecular dynamcs trajectores. References 1. Karpen, M. E.; Tobas, D. J.; Broos, C. L..Bochemstry, 1993, 32, 412-420. 2. Kabsch, W..Acta Crystallogr. 1976, 32:922 923. 3. Ramanathan,A; Yoo, J.O.; Langmead, C.J.; J. Chem. Theory Comput.2011, 7, 778 789 4. Shao, E, et.al. J. Chem. Theory Comput.2007,3,2312-2334