A Novel Method for Similarity Analysis of Protein Sequences

Similar documents
Novel 2D Graphic Representation of Protein Sequence and Its Application

Algorithm for Coding DNA Sequences into Spectrum-like and Zigzag Representations

Descriptors of 2D-dynamic graphs as a classification tool of DNA sequences

Research Article Novel Numerical Characterization of Protein Sequences Based on Individual Amino Acid and Its Application

On spectral numerical characterizations of DNA sequences based on Hamori curve representation

On Description of Biological Sequences by Spectral Properties of Line Distance Matrices

Sequence Alignment Techniques and Their Uses

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Chapter 7: Rapid alignment methods: FASTA and BLAST

An Introduction to Sequence Similarity ( Homology ) Searching

Reducing storage requirements for biological sequence comparison

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

EECS730: Introduction to Bioinformatics

Pairwise sequence alignment and pair hidden Markov models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

A profile-based protein sequence alignment algorithm for a domain clustering database

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Stage-structured Populations

Week 10: Homology Modelling (II) - HHpred

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

Computational Genomics and Molecular Biology, Fall

Biologically significant sequence alignments using Boltzmann probabilities

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

BMD645. Integration of Omics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Investigating Evolutionary Relationships between Species through the Light of Graph Theory based on the Multiplet Structure of the Genetic Code

Small RNA in rice genome

Efficient Remote Homology Detection with Secondary Structure

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Describe how proteins and nucleic acids (DNA and RNA) are related to each other.

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

In-Depth Assessment of Local Sequence Alignment

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

Sequence analysis and comparison

Data Mining in Bioinformatics HMM

Hidden Markov Models

Gibbs Sampling Methods for Multiple Sequence Alignment

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Local Alignment Statistics

Chemistry in Biology. Section 1. Atoms, Elements, and Compounds

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Genomics and bioinformatics summary. Finding genes -- computer searches

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

BIOC 460 General Chemistry Review: Chemical Equilibrium, Ionization of H 2 O, ph, pk a

Chapter 2 The Chemistry of Biology. Dr. Ramos BIO 370

Using Bioinformatics to Study Evolutionary Relationships Instructions

Lecture # 3, 4 Selecting a Catalyst (Non-Kinetic Parameters), Review of Enzyme Kinetics, Selectivity, ph and Temperature Effects

Materials Science Forum Online: ISSN: , Vols , pp doi: /

B DAYS BIOCHEMISTRY UNIT GUIDE Due 9/13/16 Monday Tuesday Wednesday Thursday Friday 8/22 - A 8/23 - B

Exercise 5. Sequence Profiles & BLAST

Single alignment: Substitution Matrix. 16 march 2017

The Distribution of Short Word Match Counts Between Markovian Sequences

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Multiple Sequence Alignment

DO NOT WRITE ON THIS TEST Topic 3- Cells and Transport

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

The diagram below represents levels of organization within a cell of a multicellular organism.

Properties of the stress tensor

Page 1. Name: UNIT: PHOTOSYNTHESIS AND RESPIRATION TOPIC: PHOTOSYNTHESIS

What is the expectation maximization algorithm?

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

2.1 The Nature of Matter

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CAP 5510 Lecture 3 Protein Structures

Introduction to Bioinformatics

ProtoNet 4.0: A hierarchical classification of one million protein sequences

Chemical and Physical Properties of Organic Molecules

Sequence Database Search Techniques I: Blast and PatternHunter tools

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Hidden Markov Models

Biological Systems: Open Access

A Novel Prediction Method of Protein Structural Classes Based on Protein Super-Secondary Structure

Bioinformatics Chapter 1. Introduction

Equilibrium constant

Algorithms in Bioinformatics

Analysis and Biological Significance of Bivalent Ligand Binding Reactions. Duane W. Sears

المرحلة الثانية- قسم علوم الحياة الجامعة المستنصرية

Linear Equations and Matrix

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

On Spaced Seeds for Similarity Search

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes

number Done by Corrected by Doctor Nafeth Abu Tarboush

O 3 O 4 O 5. q 3. q 4. Transition

Buffer solutions المحاليل المنظمة

Sequence Analysis '17 -- lecture 7

Homework - I. Hand draw the structure of an eukaryotic cell with all organelles clearly drawn and describe the function of each organelle.

Stephen Scott.

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

SUPPLEMENTARY INFORMATION

Computational Biology

CSCE555 Bioinformatics. Protein Function Annotation

Transcription:

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015) A Novel Method for Similarity Analysis of Protein Sequences Longlong Liu 1, a, Tingting Zhao 1,b and Maojuan Liu 1,c * 1 School of Mathematical Sciences, Ocean University of China, Qingdao 266100, P.R. China a xinxijishu@ouc.edu.cn, b sxzhaott@126.com, c 18706480795@163.com Keywords: protein sequence, similarity analysis, feature vector, cluster analysis, mitochondria NADH dehydrogenase Abstract. A new method for similarity analysis of protein sequences is presented in this paper. On the basis of positions, proportion difference and various physicochemical properties of 20 kinds of amino acid in different protein sequences, representative information was extracted from protein sequence and converted into a numeric vector, thus further similarities of protein sequences were analyzed by studying the similarities between vectors. To facilitate the comparison between protein sequences of different length, every protein sequence is first mapped to a fixed-length vector, of which the vector information is relative position of amino acids. Then percentage of 20 kinds of amino acids in the sequence and 3 physicochemical properties are combined to constitute physicochemical information vector. Finally, a one-dimensional feature vector with 80 elements(feature vector) representing a protein sequence is synthesized. The shortest distance method was applied for cluster analysis on feature vectors so as to analyze similarities in protein sequences. In the numerical experiment part of the article, similarity analysis was conducted for 9 different species of the mitochondrial NADH dehydrogenase. The result of numerical experiment is consistent with the biological fact, which validates the effectiveness of model to a certain extent. Introduction In the field of biology, there are a variety of mathematical methods such as PSI-BLAST[1], Hidden Markov Model (HMM)[2], etc. used to analyze the protein sequences and thus to search the classification structure and function. In recent years, graphical methods for protein analysis have been proposed[3]. One class of methods is to represent 20 amino acids that compose protein by alphabet Ω={A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}, so that any protein can be regarded as a sequence composed of 20 letters. Randic[4], P. He[5] and Yao[6] proposed methods for similarity analysis of protein sequences based on 20 kinds of amino acids which considered either the location information or the physicochemical properties of protein sequence.nandy[7], Randic[8] and Raychaudhury[9] have put forward a new method, namely compressed matrix invariant method to investigate the mapping relationship between biological sequences. However, due to the nature of the complexity of the protein, such methods to reveal the nature of the difference between protein sequences are not enough. A new model for study on similarity of protein sequences was proposed in this paper based on position information of 20 amino acids in the protein sequence combined with their physicochemical properties. Methods Construction of amino acid position information vector Mathematically speaking, a protein sequence may be regarded as a string of 20 amino acids on the alphabet Ω={A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}. Thus, a protein sequence of 20 G = g, of amino acids in this order that can be mapped (marked as f ) into an adjoin matrix ( ij ) 20 which the rows represent 20 amino acids(the order is same to Ω), the columns represent the positions of amino acids in the sequence, N represents the length of protein sequence (number of amino acid residues). The element g ij of matrix G is defined as follow: (1) If the ith alphabet of Ω appears in the jth position of the sequence, then N 2015. The authors - Published by Atlantis Press 2216

(2) If the ith alphabet of Ω disappears in the jth position of the sequence, then gij g ij = 0 For example, in a given protein sequence YCFESRNDPAKDPVILWLNGGPGCSSLTGA, the obtained adjoin matrix G according to the above mapping f is: 0 0 L 30 0 2 0 G L = M M O M 1 0 L 0 20 30 According to the definition of adjoin matrix, it is sparse matrix, while such matrix representation of protein sequence is unique. Therefore, the conversion successfully maps a protein sequence into a matrix. On the contrary, the matrix can be converted into the original protein sequence. In order to extract the sequence information contained in adjoin matrix, the Euclidean distance between each two rows of adjoin matrix is obtained, so that we can get distance matrix D= ( d ij) 20 20. Each element d ij of D refers to the Euclidean distance between the i type and j type of amino acids, namely N ( ) 2 d = g g, i, j = 1,2,...,20 ij ik jk k = 1 Where N refers to the length of protein sequence. As can be seen, D is a real symmetric matrix. The elements of matrix D are normalized, then the elements of each row are averaged, 20 d ij j = 1 α α i =, 1,2,...,20 20 i = i.e; where i represents the average of distances of the amino acid i with 20 α = kinds of amino acids, so a one-dimensional column vector with 20 elements ( α ) 1, α2,..., α20 can be obtained, which is the position information vector of protein sequence. Construction of the physicochemical information vector Physicochemical properties of amino acids are the major basis of protein to fold into spatial structure. Since the value pk α of amino acid refers to the relative difficulty for amino acid to release + its dissolvable proton. The two functional groups α COOH and α NH3 are weak acidic groups that release protons via ionization in aqueous solution, while the capacity of releasing and receiving protons is the fundamental chemical property of protein which is also involved in the catalytic activity of the enzyme and the structure composition of the protein. pi is the isoelectric point of protein which can reflect the general information of protein composed of amino acid. Hence, the 3 physicochemical properties are selected for mathematical description of the protein sequence. The corresponding pk α (COOH), pk α ( NH + 3 ) and pi (25 C)values are respectively given in reference [5]. However, these physicochemical values are same in different protein sequences under same conditions, which mean the differences comparison between protein sequences cannot only rely on these indicators. Thus, the proportions of 20 amino acids in protein sequence as well as the above 3 physicochemical indicators should be taken into consideration. First of all, make a statistics for the percentage of 20 amino acids in protein sequence, namely: ni β i =, i = 1,2, L,20 (2) N Where n refers to the present times of the amino acid i in the protein sequence, N indicates i the length of the protein sequence. Therefore, the percentage vector β ( β, β,..., β ) 2217 = j (1) = of 20 kinds

of amino acids is obtained. Then respectively multiply the three physicochemical indicators of 20 β y, y, y, where y, j= 1,2,3 kinds of amino acids with corresponding percentages, i.e. ( ) i i1 i2 i3 refers to the physicochemical indicator of i amino acid. Thus a one-dimensional vector with 60 elementsγ = ( γ1, γ2,..., γ60) composed of 3 physicochemical indicators of 20 amino acids (namely physicochemical information vector) is obtained. The physicochemical information vector is normalized later. Synthesis of the feature vectors of protein sequences First of all, transpose the one-dimensional position information vector with 20 elements α α = αα,,..., α, then combine α with the one-dimensional obtained in 2.1 and get ( ) physicochemical information vector with 60 elementsγ = ( γ1, γ2,..., γ60) and get a one-dimensional feature vector with 80 elements P = ( αγ, ) = ( α1,... α20, γ1,..., γ60) = ( p1, p2, L, p80), namely the 20 amino acids location information and physicochemical characteristics vector in protein sequence. Here we have completed the conversion from a protein sequence to the value vector. A one-dimensional feature vector with 80 elements can be obtained for protein sequence with any length via above methods. Only by converting protein sequences of different length without loss of information values into fixed-length vector can further comparative analysis be conducted. Algorithm Specific algorithm is summarized as follow: a. To map the protein sequence containing 20 kinds of amino acid residues into a unique adjoin matrix G 20 Ν. Calculate the Euclidean distances between rows of adjoin matrix and obtain the distance matrix D 20 20 and then normalized the distance matrix D 20 20. Calculate the average of row of D 20 20 and obtain a one-dimensional column vector with 20 elements α ( α, α,..., α ) =, i.e. the position information vector. b. Conduct statistics for the percentage of 20 amino acids. Respectively multiply the three physicochemical indicators of 20 kinds of amino acids with corresponding percentages, a one-dimensional physicochemical information vector with 60 γ = ( γ, γ,..., γ ) elements 1 2 60 is obtained. c. Transpose the vector ( ) α = α, α,..., α obtained in step a and combine P= ( p, p, L, p ). ij α with γ, then, get a one-dimensional feature vector with 80 elements 1 2 80 d. Calculate the Euclidean distances between every 2 different protein sequences based on the feature vectors of protein sequences. e. With the shortest Euclidean distance for hierarchical clustering, further analyze protein sequences similarity. Result The 9 different species of mitochondrial NADH dehydrogenase (ND5) sequences were selected from NCBI website (data from paper [10]). The proposed method was applied to the similarity analysis of 9 kinds of ND5 protein, which means first obtain a one-dimensional feature vectors with 80 elements of 9 species of protein sequences. In order to reduce the impact of numerical difference on numerical experiment, the feature vector data were normalized. Eq.(3) is adopted to calculate the Euclidean distance of pairwise species, so that a 9 9real symmetric matrix was obtained as shown in Table 1. 2218

Table 1 Euclidean distances of nine species ND5 protein sequences corresponding to the feature vectors Species Hum Gorilla P.Chimp C.Chimp F.W B.Whal Rat(7) Mou Oposs an(1) (2) anzee(3) anzee(4) hale(5) e(6) se(8) um(9) Human(1) 0.000 Gorilla(2) 0.186 0.000 P.Chimpanzee(3) 0.174 0.225 0.000 C.Chimpanzee(4) 0.138 0.209 0.121 0.000 F. Whale(5) 0.280 0.286 0.280 0.317 0.000 B.Whale(6) 0.296 0.327 0.291 0.330 0.097 0.000 Rat(7) 0.635 0.637 0.577 0.635 0.546 0.543 0 Mouse(8) 0.708 0.717 0.672 0.718 0.604 0.595 0.238 0 Opossum(9) 0.942 0.913 0.889 0.949 0.792 0.808 0.464 0.509 0 The shortest-distance method was used in this paper for hierarchical cluster analysis. The result was provided in Tab2. D 1 after the first merger based on the shortest distance Table 2 The inter-class distance ( ) 1 2 3 4 5,6 7 8 9 1 0 2 0.186 0 3 0.174 0.225 0 4 0.138 0.209 0.121 0 5,6 0.280 0.286 0.280 0.317 0 7 0.635 0.637 0.577 0.635 0.543 0 8 0.708 0.717 0.672 0.718 0.595 0.238 0 9 0.942 0.913 0.889 0.949 0.792 0.464 0.509 0 we can see that the ND5 protein sequence of Fin Whale(5) and Blue Whale(6) are the most similar. ND5 protein sequence in Pigmy Chimpanzee(3),Common Chimpanzee(4),Human(1) and Gorilla(2) are very similar, while Table.3 further demonstrates the ND5 protein {Common Chimpanzee(4),Human(1)} are more similar, while ND5 protein in Rat(7) and Mouse(8) are more similar. On the other hand, we found that the difference of ND5 protein between specie (9) and other species are greatest. In addition, the results obtained from numerical experiments are consistent with the fact of biological evolution[11,12], which also demonstrates the effectiveness of this method to some extent. Conclusion First, a one-dimensional feature vector with 80 elements was built in this paper, including spatial information and physicochemical information of protein sequence; then, the analysis of sequences similarity was conducted via hierarchical clustering for feature vectors of protein sequences; finally, in the numerical experiment part, similarity analysis was carried out for nine species of mitochondrial NADH dehydrogenase. The numerical results coincide with the fact that biological evolution. Compared with the previous methods, the proposed method in this paper has a small amount of calculation but with high accuracy, which is an effective new method. Acknowledgment The authors would like to thank all of the researchers who made publicly available the data used in this study. The authors would like to thank the University Basic Research Foundation 2219

(No:201362031) and the National Natural Science Foundation of China (No: 61303145) for the support to this work. References [1] Altschul SF, Madden TL, Schafer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: A new generation of protein data, Nucleic Acids Research, 25(17) (1997) 3389-3402. [2] Krogh A, Brown M, Mian I, Sjolander K, Haussler D. Hidden markov models in computational biology: Applications to protein modeling, Journal of Molecular Biology, 235 (1994) 1501-1531. [3] Randic M. 2-D Graphical represeatation of proteins based on virtual genetic code, SARQSAR Environ Res, 15 (2004) 147-157. [4] Randic M, J. Zupan, A.T. Balaban, D. Vikic-Topic, D. Plavsic, Graphical representation of proteins, Chem. Rev. 111 (2011) 790 862. [5] P.A. He, J.Z. Wei, Y.H. Yao, Z.X. Tie, A novel graphical representation of proteins and its application, Physica A,2012, 391:93 99. [6] Maggiora GM, Shanmugasundaram V. Molecular similarity measures, Methods Mol Biol. 672 (2011) 39-100. [7] Nandy A, Basak SC, Simple numerical descriptor for quantifying effect of toxic substances on DNA sequences, J Chen InfComputSci. 40 (2000) 915. [8] Randic M.,Mehulic, K.,Vukicevic, D.,Graphical representation of proteins as four-color maps and their numerical characterization, J.Mol.Graph.Model. 27 (2009) 637 641. [9] Rayehaudhury C, Nandy A. Index scheme and similarity measures for macromolecular sequences, J ChemInfComputSci, 39 (1999) 243. [10] Maggiora GM, Shanmugasundaram V. Molecular similarity measures, Methods Mol Biol. 672 (2011) 39-100. [11] Yao YH, Dai Q, Li C etc. Analysis of similarity/dissimilarity of protein sequences, Proteins, 73(4) (2008) 864-871. [12] Shiyuan Wang, FengchunTian, YuQiu, XiaoLiu, Bilateral similarity function: A novel and universal method for similarity analysis of biological sequences, Journal of Theoretical Biology 265 (2010) 194 201. 2220