Performance of Different Algorithms on Clustering Molecular Dynamics Trajectories

Similar documents
18.1 Introduction and Recap

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

x = , so that calculated

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Lecture 10 Support Vector Machines II

Grover s Algorithm + Quantum Zeno Effect + Vaidman

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Chapter 3 Describing Data Using Numerical Measures

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Note on EM-training of IBM-model 1

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Calculation of time complexity (3%)

A Robust Method for Calculating the Correlation Coefficient

Lecture 4: November 17, Part 1 Single Buffer Management

Lecture 4. Instructor: Haipeng Luo

4DVAR, according to the name, is a four-dimensional variational method.

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

Errors for Linear Systems

Spatial Statistics and Analysis Methods (for GEOG 104 class).

Chapter 13: Multiple Regression

Notes on Frequency Estimation in Data Streams

Markov Chain Monte Carlo Lecture 6

Homework Assignment 3 Due in class, Thursday October 15

Regularized Discriminant Analysis for Face Recognition

10-701/ Machine Learning, Fall 2005 Homework 3

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

I + HH H N 0 M T H = UΣV H = [U 1 U 2 ] 0 0 E S. X if X 0 0 if X < 0 (X) + = = M T 1 + N 0. r p + 1

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

AS-Level Maths: Statistics 1 for Edexcel

Lecture 10: May 6, 2013

Definition. Measures of Dispersion. Measures of Dispersion. Definition. The Range. Measures of Dispersion 3/24/2014

Supplement to Clustering with Statistical Error Control

Lecture Space-Bounded Derandomization

STATISTICAL MECHANICS

Lecture Notes on Linear Regression

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Week 5: Neural Networks

Comparison of Regression Lines

Composite Hypotheses testing

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Expectation Maximization Mixture Models HMMs

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Chapter 6. Supplemental Text Material

Simultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

One-sided finite-difference approximations suitable for use with Richardson extrapolation

Chapter 8 Indicator Variables

Lecture 3. Ax x i a i. i i

Split alignment. Martin C. Frith April 13, 2012

Lecture Nov

Mathematical Preparations

Assortment Optimization under MNL

MMA and GCMMA two methods for nonlinear optimization

Clustering gene expression data & the EM algorithm

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Problem Points Score Total 100

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

1 GSW Iterative Techniques for y = Ax

Lecture 6 More on Complete Randomized Block Design (RBD)

Kernel Methods and SVMs Extension

Introduction to Algorithms

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

ANOVA. The Observations y ij

Polynomial Regression Models

Hashing. Alexandra Stefan

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

A new construction of 3-separable matrices via an improved decoding of Macula s construction

Gaussian Mixture Models

Analysis of Discrete Time Queues (Section 4.6)

Online Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting

Lecture 4: September 12

Linear Regression Analysis: Terminology and Notation

NUMERICAL DIFFERENTIATION

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Week 9 Chapter 10 Section 1-5

Physics 181. Particle Systems

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

Temperature. Chapter Heat Engine

} Often, when learning, we deal with uncertainty:

Lecture Note 3. Eshelby s Inclusion II

Problem Set 9 Solutions

Image Processing for Bubble Detection in Microfluidics

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

Tracking with Kalman Filter

An (almost) unbiased estimator for the S-Gini index

A FAST HEURISTIC FOR TASKS ASSIGNMENT IN MANYCORE SYSTEMS WITH VOLTAGE-FREQUENCY ISLANDS

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

/ n ) are compared. The logic is: if the two

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Lecture 17 : Stochastic Processes II

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Economics 101. Lecture 4 - Equilibrium and Efficiency

The Study of Teaching-learning-based Optimization Algorithm

Transcription:

Performance of Dfferent Algorthms on Clusterng Molecular Dynamcs Trajectores Chenchen Song Abstract Dfferent types of clusterng algorthms are appled to clusterng molecular dynamcs trajectores to get nsght about possble conformatons for molecules.. The algorthms covered nclude, (multvarate Gaussan), snglelnage, centrod-lnage, average-lnage, complete-lnage. Root-Mean-Square-Devaton( RMSD) s used as metrc. Performances of algorthms are analyzed and compared based on Daves-Bouldn ndex(dbi) and Pseudo-F statc (psf). 1. Introducton Molecular dynamcs smulaton methods produce trajectores of molecule confguraton snapshots as a functon of tme. The tme scale of chemcal process s ns, whle the tme scale of molecule nternal freedom s fs, thus a suffcent smulaton trajectory wll need to contan at least O(N 6 ) confguratons. For such large amounts of data, machne learnng algorthm becomes helpful to extract useful nformaton from the datasets. One type of nformaton we hope to get s the conformaton substates of a molecule, whch falls nto the clusterng problem. Usng clusterng algorthm to help analyze molecular dynamcs trajectores s actually not a new dea, and can date bac to 1993. Snce then, a number of papers about applyng dfferent types of clusterng algorthms on MD have been publshed. Thus, n ths project, I wll focus on comparng the performance of dfferent clusterng algorthm. 2. Method Detals of the methods have been carefully dscussed n the mlestone report. Here we only gve a bref revew. (1) Smlarty Metrc Instead of normal Eucldean norm, RMSD wll be used as metrc, whch frst tres to algn two molecule as much as possble before calculatng Eucldean dstance. By usng RMSD, we can elmnate the effect from translatonal and rotatonal moton of molecule. (2) Algorthm and multvarate Gaussan(wll be called for short) have been ntroduced n class. The lnage methods are dfferent n how the dstance between clusters s defned. Sngle(edge)-lnage uses the shortest nter-cluster pont-to-pont dstance. Centrod-lnage uses the dstance between cluster centrods. Average-lnage: uses average dstances between ndvdual ponts of the two clusters. Completelnage uses maxmal pont-to-pont dstance. Due to ther dfferent defnton of crtcal dstance, latter we wll see that they sometmes can have very dfferent behavor. Sngle-lnage, complete-lnage, and average-lnage only need to calculate a metrc matrx of sze N 2 at the very begnnng. Other methods need to update the postons of centrods and relatve dstances from ponts to centrods durng each teraton. Because our metrc s not a smple Eucldean dstance, the latter methods wll be more tme-consumng. (3) Performance Metrc DBI s defned as 1 d dj DB D D (1.1) max j j, j 1 d, j where d j s dstance between centrods and d s the average dstance between ponts n some cluster wth the centrod of that cluster. psf s defned as: SS N B psf SS 1 w 2 2 B, w 1 1 xc SS n m m SS x m (1.2) where m s centrod and m s the overall mean of data. Usually, lower DBI and hgher psf reflects compact and well-separated DBI. But one should be careful when usng these ndces. For example, DBI s affected by cluster count, we should only compare DBI values when the number of clusters s smlar. 3. Results and Analyss Molecular dynamcs trajectores are generated by Terachem pacage. RMSD calculaton and molecular algnment s performed usng VMD pacage. If the methods only requres N 2 metrc matrx as dscussed n the prevous sectons, then the clusterng s performed by MATLAB statstc box. Otherwse, the clusterng s performed by Cluster3.0, where we have added our metrc nto the clusterng lbrary. DBI and psf are calculated by MATLAB. 3.1 Clusterng ponts n 2D-plane from unform dstrbuton Frst, dfferent clusterng methods are appled to a hundred ponts n 2D plane whch are sampled from unform dstrbuton. The ponts don t have any nternal

Orgnal Ln-complete Ln-sngle Ln-complete Ln-sngle Ln-average Ln-centrod Ln-average Ln-centrod -means -means structures, thus the clusterng results wll only reflect the propertes of dfferent algorthm. Fgure 1. Clusterng results for ponts n 2D-plane drawn from unform dstrbuton From Fg.3, the followng propertes can be observed. (1) Most of the methods tend to naturally and equally partton the ponts nto four blocs. Ths s especally true for -means. (2) Sngle-lnage (or edge-lnage) almost classfes all the ponts nto a same cluster. Ths mght because the crtcal dstance of sngle-lnage s defned as the nearest ponts between two clusters, thus sngle-lnage may be very senstve to cases where ponts are close to each other and no clear border exsts. (3) method s the only one that produces clusters wth very dfferent shapes. Ths mght be method does clusterng based on probablstc assumptons whle all other methods are based on geometrc structures. Fgure 2. Clusterng results for ponts n 2D-plane drawn from three equally szed overlapped Gaussan dstrbuton. Compared to the orgnal fgure, the followng propertes can be notced. (1) The success of -means may agan due to the nternal property of -means to produce clusters wth same sze and same shape. (2) can get pretty good result f the underlyng probablty dstrbuton s very close to Gaussan. (3) Centrod lnage doesn t wor well, perhaps because the centrods are ll-defned. (4) Sngle-lnage seems to fal for ths crcumstance agan where ponts have no clear borders. 3.3. Clusterng Artfcal MD data: Four equally szed clusters 3.2 Clusterng ponts on 2D-plane from overlapped Gaussan dstrbuton. In the second step, clusterng methods are appled to ponts sampled from three ndependent Gaussan dstrbutons. The mean and covarance of 2D Gaussan s tuned so that the three dstrbutons are overlapped. The reason to test on overlapped ponts s that n MD smulatons, the trajectores are generated consequently, thus adjacent confguratons are usually very smlar. Fgure 3. Illustratons of typcal cyclo-hexane conformatons. Cyclo-hexane s used as a test model. Four clusters are char, twsted boat, boat, and half-char. To generate each cluster, tae char as an example, we start from a

L-sngle L-centrod L-Complete L-sngle L-average L-centrod L-Complete L-average char confguraton, control the tmestep to 0.01fs, control the temperature to very low and only proceed 100 steps. Repeat ths procedure several tmes, we can guarantee to generate a cluster wth typcal char confguratons. Because the fve dfferent clusters are generated ndependently and artfcally, there s actually a very clear border between dfferent clusters. The fgure shows the expected dealzed behavor ncludng mnma n the DBI, maxma n psf when clusterng number s equal to the optmal value of 4. DBI psf Ths set s much dffcult to cluster as t has both very small clusters wth small varance and relatve large clusters wth large varance. DBI psf Fgure 4.DBI and psf for clusterng artfcal MD trajectores of cyclo-hexane. Optmal clusters are four equally szed clusters. X-axs range from 3 to 8. By checng the assgnments, t s found that for ths test wth equal clusterng sze and clear border, most of the algorthm perform very well except. Ths mght be because Gaussan dstrbuton s not a good assumpton for how confguratons dstrbute wthn each cluster around the centrod. 3.4 Clusterng artfcal MD data: Fve dfferentally szed clusters. In real MD smulatons, the szes of clusters can be very dfferent, because the lower the energy s, the hgher probablty t wll appear durng smulaton. To mmc ths property, n ths test, we buld an artfcal MD data by combnng: 2 planar structures, 15 half-char structures, 30 boat structures, 50 twst-boat structures, and 100 chars. The order s also consstent wth the energy order. Fgure 5.DBI and psf for clusterng artfcal MD trajectores of cyclo-hexane. Optmal clusters are fve dstnct szed clusters. X-axs range from 3 to 8. Table 1. Sze of clusters for dfferent algorthm wth dfferent number of clusters. Method #cluster Cluster sze 4 46 47 50 54 Kmeans 5 13 41 46 47 50 6 10 15 20 25 45 82 L-sngle 5 2 15 30 50 100 6 2 15 (2, 28) 50 100 L-average 5 2 15 30 50 100 6 2 15 (12, 18) 50 100 L-centrod 5 2 15 30 50 100 6 2 15 (12, 18) 50 100 L-complete 5 2 15 30 50 100 6 2 15 (5, 25) 50 100 4 34 66 108 192 5 21 32 47 100 200 6 43 49 51 57 100 100 It can be notced that when confguratons are not unformly separated, the metrcs are less consstent and not necessary gets optmal value at optmal cluster numbers.

Ln-Centrods Ln-Complete From the table and fgure the followng propertes could be observed (1) For ths tests where clear border exsts but cluster szes are very dfferent, all the lnage methods are able to recover the correct assgnment at optmal cluster number 5. When cluster number ncreases, dfferent methods then splt clusters n dfferent way due to ther dfferent dstance defnton. (2) K-mean exhbts a strong tendency to cluster ponts nto equal sze, thus doesn t gve good performance for ths tests. (3) also gves pretty bad result, possbly due to same reason as prevous tests. Fgure 7. Centrods for each cluster. To explan ths behavor, the energy profle for a contnuous changng cyclohexane s checed. 3.5 Clusterng real MD data. Unle the artfcal trajectores n whch confguratons are clearly separated, confguratons from adjacent steps n real MD smulatons are very smlar and wll mae clusterng to be more dffcult. Cyclohexane s agan used as the test examples. Multple trajectores are generated startng from half-char structure. After clusterng each trajectores ndvdually, two dstnct typcal types of behavor s shown below. Fgure 8. Cyclohexane energy profle. It can be seen from the fgure that, f startng from halfchar, t can ether goes to the left, the char regon( f the plat part flps downwards) or goes to the rght, the boat regon(f the plat fart flps upwards). Once t steps to the left, t wll be separated from the other part by a hgh energy barrer. Smlar for steppng nto the rght regon. All methods gve the almost the same clusterng when cluster number s three. If we set cluster number to two, methods wll have dstnct results. Fgure 6. Two dstnct types of behavor and clusterng results when clusterng real MD trajectores of cyclo-hexane startng from half-char conformaton. (0) Char (2) Upward twst-boat (3) boat (4) Downward twst-boat

RMSD Ln-sngle Ln-Average Fgure 8. Dfferent clusterng results from dfferent algorthm when number of cluster s 2. It can be seen from the fgure that : (1) and lnage-complete method merges upward and downward twst-boat. These two methods emphasze more on the dfference between boat and twst-boat. (2) Ln-centrods and ln-average merge boat and upward twst boat. and wegh more on the dfference between reversed confguratons. (3) Ln-sngle doesn t preform very well for ths tests, perhaps because t cannot handle crcumstances where clear border s absent. 3.6 Clusterng proten trajectores Fnally, clusterng method s appled to proten trajectores. Complete-lnage method s used, because t performs well n the prevous tests and only requres a N 2 metrc matrx, wthout necessty to compute updated centrods durng each teraton. Optmal clusterng number s pced out by psf ndces. Consderng the avalable computng ablty, we apply a pullng force to the proten so that the structure wll change more rapdly. Only the coordnates of bacbones (carbon, ntrogen, oxygen) are passed nto the clusterng program. Results are shown below. Tme Fgure 9. Clusterng results for proten. The three clusters clearly show the transformaton from folded, half-unfolded to completely unfolded under the pullng force we apply to the system. Fgure 10. Centrods for three clusters. Summary In ths project, we use several tests to compare and analyze the performance of dfferent clusterng methods under varous condtons. The followng propertes can be summarzed from our observatons: (1) tends to produce blocy clusters of smlar sze. Thus when cluster szes are smlar, gves very good performance. But t usually fals for clusters wth dstnct szes. (2) Sngle-lnage s very senstve to closely spaced ponts. As a result t may be fragle to the presence or absence of sngle pont. (3) Multvarate Gaussan doesn t wor well for most of the tests, perhaps because the assumpton of Gaussan dstrbuton s not good n our problem (4) Centrod-, average-, and complete-average gves qute consstency good results through the results. The latter two doesn t requre updatng centrods durng each teraton, thus may be canddates for clusterng molecular dynamcs trajectores. References 1. Karpen, M. E.; Tobas, D. J.; Broos, C. L..Bochemstry, 1993, 32, 412-420. 2. Kabsch, W..Acta Crystallogr. 1976, 32:922 923. 3. Ramanathan,A; Yoo, J.O.; Langmead, C.J.; J. Chem. Theory Comput.2011, 7, 778 789 4. Shao, E, et.al. J. Chem. Theory Comput.2007,3,2312-2334