In order to compare the proteins of the phylogenomic matrix, we needed a similarity

Similar documents
Basic Local Alignment Search Tool

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

SUPPLEMENTARY INFORMATION

BLAST. Varieties of BLAST

In-Depth Assessment of Local Sequence Alignment

Unsupervised machine learning

Sequence Alignment Techniques and Their Uses

Tools and Algorithms in Bioinformatics

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Machine Learning (BSMC-GA 4439) Wenke Liu

BLAST: Target frequencies and information content Dannie Durand

Reducing storage requirements for biological sequence comparison

SUPPLEMENTARY INFORMATION

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Algorithms in Bioinformatics

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Robustness of Principal Components

Discriminative Direction for Kernel Classifiers

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Sequence Database Search Techniques I: Blast and PatternHunter tools

Bioinformatics and BLAST

Towards Detecting Protein Complexes from Protein Interaction Data

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Introduction to Evolutionary Concepts

Rare Event Discovery And Event Change Point In Biological Data Stream

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Motivating the Covariance Matrix

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Dimensionality Reduction Techniques (DRT)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

The weighted spectral distribution; A graph metric with applications. Dr. Damien Fay. SRG group, Computer Lab, University of Cambridge.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

L26: Advanced dimensionality reduction

Descriptive Data Summarization

Supplementary Information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Intelligent Data Analysis Lecture Notes on Document Mining

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Lecture Notes 5: Multiresolution Analysis

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Statistical Pattern Recognition

An Introduction to Sequence Similarity ( Homology ) Searching

Improved network-based identification of protein orthologs

Tools and Algorithms in Bioinformatics

Least Squares Optimization

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

A phylogenomic toolbox for assembling the tree of life

CS281 Section 4: Factor Analysis and PCA

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Selecting protein fuzzy contact maps through information and structure measures

Linear Spectral Hashing

Bioinformatics for Biologists

Computational approaches for functional genomics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Better Bond Angles in the Protein Data Bank

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Iterative Laplacian Score for Feature Selection

3.3 Eigenvalues and Eigenvectors

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

PeptideProphet: Validation of Peptide Assignments to MS/MS Spectra

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors

PATTERN CLASSIFICATION

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Introduction to multivariate analysis Outline

RaRE: Social Rank Regulated Large-scale Network Embedding

Principal Component Analysis

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Unsupervised Learning with Permuted Data

Final Exam, Machine Learning, Spring 2009

Bioinformatics. Dept. of Computational Biology & Bioinformatics

K-means-based Feature Learning for Protein Sequence Classification

Unsupervised Learning in Spectral Genome Analysis

BIOINFORMATICS. Improved Network-based Identification of Protein Orthologs. Nir Yosef a,, Roded Sharan a and William Stafford Noble b

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

The Accuracy of Network Visualizations. Kevin W. Boyack SciTech Strategies, Inc.

A MULTIVARIATE MODEL FOR COMPARISON OF TWO DATASETS AND ITS APPLICATION TO FMRI ANALYSIS

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Sparse representation classification and positive L1 minimization

Least Squares Optimization

Feature selection and extraction Spectral domain quality estimation Alternatives

Phylogenetic inference

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Leverage Sparse Information in Predictive Modeling

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

An indicator for the number of clusters using a linear map to simplex structure

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Transcription:

Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for presence/absence quantification and have been shown to under-perform compared to other metrics 1. Pearson correlations between raw bit score vectors are an improvement in that they allow degrees of sensitivity; they are arbitrarily sensitive to the presence of large values in a given column, a phenomenon that happens with some frequency given the long right tail of the bit score distribution 2. Mutual information 3 is an attractive alternative for naturally discrete data sets, but it is less desirable for continuous data because it requires the estimation of a joint density for each pair-wise comparison and is quite sensitive to histogramming criteria 4. Robust correlation methods such as the Maronna or Quadrant Correlation algorithms 5 are theoretically attractive, but computationally intense and are too slow to handle large data sets. Thus, as a compromise between robustness and simplicity, we used Spearman s rank correlation (with tie-correction) to compute similarities between rows rather than the raw correlation. Using ranks rather than raw scores has several advantages. First, ranks are a natural representation for the ordered data returned from a BLASTP 2 result. Second, ranks are invariant under monotonic transformations such as scaling or translation. Finally, rank correlations are a proven and well characterized way to extract robust similarity information from data sets where the use of Pearson correlation may be questionable. We thus computed rank correlations between all pairs of profiles and retained the top 50 similarity values for each profile as our similarity matrix.

Map Construction The purpose of 2-dimensional ordination is to enable human visualization of high dimensional data. Given a similarity matrix computed from a given data matrix, we seek to put proteins with similar profiles near each other on a 2-dimensional map. As each protein in the original phylogenomic matrix has more than 200 dimensions, a projection into 2- dimensions reduces the dimensionality by more than 2 orders of magnitude. This data reduction is possible for two reasons. First, the original phylogenomic data matrix is sparse the majority of entries are zero. Second, heuristically limiting the algorithm to use the top 50 similarity scores allows it to preserve neighborhoods (local accuracy) at the expense of global integrity. In the context of phylogenomic mapping, this means that the 2-dimensional placement is optimized to show groups of proteins which are likely to interact on the basis of coinheritance. In brief, the dimension reduction problem was framed as an optimization problem find the representation of a high dimensional set in a low dimensional space which keeps similar pairs of proteins as close together as possible. The phylogenomic matrix was used to generate a square matrix of similarities (in our case, rank correlations). This similarity matrix was used to produce a quadratic form, whose minimization, subject to appropriate constraint, was the multidimensional scaling step of the ordination algorithm. The result of this step was an initial placement of proteins into 2-dimensions. A force-directed placement algorithm was then applied to refine this initial placement and obtain a final set of (x,y) coordinates in the plane for each protein. We used VxInsight (Viswave LLC) to perform both ordination and visualization 6. Fully interactive phylogenomic maps for more than 200 bacteria are available online at http://myxococcus.syr.edu/phylo.

Noise Filtering Our goal was to use phylogenomic information to distinguish between protein clusters with different evolutionary histories. However, despite their disparate pasts, all proteins in a given genome have one thing in common they have a trivial hit to at least one ORF within their own genome. Including the fact that every protein in M. xanthus has a hit in M. xanthus adds unremarkable and computationally undesirable similarity information. This is because the spectral projection at the core of the ordination step is analogous to a principal components representation of the data matrix. Including the strong, yet uninteresting signal from self-hit columns would dominate the first of our two coordinate eigenvectors and degrade our ability to interpret differences in location on the phylogenomic map as differences in evolutionary history. Furthermore, including self-hit columns is undesirable from an annotation standpoint as it uses mildly circular inference a draft genome would be used to annotate itself, rather than using other genomes (of presumably known quality) as independent data sources. For these reasons we removed the self-hit column from the data matrix and excluded from the analysis those proteins which only had BLASTP hits to M. xanthus sequences. In order to ensure that the proteins on our map had nontrivial phylogenetic profiles, we further restricted our data set to only those proteins with hits in at least five different sequenced genomes. Without this step, we encountered a problem different from the spurious agglomeration problem that arose via inclusion of M. xanthus self-hits. Instead of spurious agglomeration, the inclusion of proteins with very sparse row entries caused spurious differentiation. That is, the presence of proteins with only one or two hits across 200 genomes resulted in a profusion of mini-mountains with only one species hit across much of our landscape. This reduced the dynamic range for discrimination between other protein groups.

Paralogy Correction Proteins with similar primary sequences tend to have similar BLAST neighbors, and hence, similar phylogenetic profiles. We were interested in those proteins which were frequently coinherited, yet had different primary sequences and BLAST neighbors, as these were the potentially novel linkages that could not be discerned from an analysis of BLAST alone. For each pair of proteins on the map, we extracted the corresponding sets of GenInfo Identification (GI) numbers returned by a run of BLASTP against a local database of sequenced microbial genomes. We then computed the Jaccard coefficient between these two sets, which is defined as (A B)/(A B). We considered two proteins to have an association predictable from BLAST alone if they had Jaccard coefficients greater than 0.5. Intuitively, a high Jaccard coefficient means that a manual comparison of the two BLASTPgenerated GI lists would reveal much overlap and hint at functional similarity. When this network was superimposed on the phylogenomic map, we could detect at a glance whether the colocation of a given pair of proteins was due to sequence similarity or whether it was due to nontrivial similarity in coinheritance patterns. A novel phylogenomic linkage between two proteins indicates that they are present in the same sets of organisms despite having different primary sequences, and is indicated on the map by proteins that are nearby but unconnected (see Figure 2c). Mountain Discretization To facilitate map analysis, we used an Expectation Maximization algorithm based on Gaussians with spherical covariance matrices 7 to discretize the 2-dimensional coordinates of each map into individual mountains. To initialize the number of basis Gaussians, the gap statistic of Tibshirani et al. 8 was applied to provide a robust estimate of the number of mountains present in the reduced 2-dimensional map space.

References 1. Enault, F., Suhre, K., Abergel, C., Poirot, O. & Claverie, J.M. Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics 19 Suppl 1, i105 i107 (2003). 2. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389 3402 (1997). 3. Date, S.V. & Marcotte, E.M. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat. Biotechnol. 21, 1055 1062 (2003). 4. Paninski, L. Estimation of entropy and mutual information. Neural Computation 15, 1191 1253 (2003). 5. Chilson, J., Ng, R., Wagner, A. & Zamar, R. Parallel computation of high dimensional robust correlation and covariance matrices. in Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining 533 538 (ACM Press, Seattle, WA, USA; 2004). 6. Davidson, G.S., Wylie, B.N. & Boyack, K. Cluster stability and the use of noise in interpretation of clustering. in Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS'01), 23 30 (IEEE Computer Society, 2001). 7. McLachlan, G.J. & Krishnan, T. The EM Algorithm and Extentions. (John Wiley & Sons, New York, 1997). 8. Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a dataset via the Gap statistic. J. Roy. Stat. Soc. B 63, 411 123 (2001).