A profile-based protein sequence alignment algorithm for a domain clustering database

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "A profile-based protein sequence alignment algorithm for a domain clustering database"

Transcription

1 A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing Technology, Chinese Academy of Sciences 2, Graduate School of Chinese Academy of Sciences 3, National Natural Science Foundation of China Abstract- Aiming at the two main shortcomings in Homology Modeling, we have designed and established a domain clustering database. Searching the database is a fundamental work for it. However, current alignment algorithms are mainly based on the sequences, ignoring the structure conservation in domain. This paper proposed a profile-based alignment which considers the structure information into the profile, based on the character of our domain database. We designed an experiment within the database. The results show that both the quality and sensitivity of our scheme are better than pure Smith-Waterman and sequence-based profile algorithms. We strongly believe that this work can help to improve the protein structure prediction. I. INTRODUCTION Sequence alignment is a fundamental tool in Computational Biology and Bioinformatics. With this tool we can get a lot of useful information, such as which genes have the same function, which RNAs belong to the same class and which proteins have the same structure topology, etc. Moreover, in the area of protein structure prediction, obtaining the alignment between structure-unknown protein sequence (query) and its structure-known homologies (templates) is the most fundamental step in the modeling processing, and the quality of the alignment affects the prediction result greatly. Generally speaking, there are three categories of methods to create an alignment: single sequence based, multiple sequence alignments and profile based. Single sequence based methods use the standard dynamic programming algorithm to generate the alignment, for example, Needelman-Wunch algorithm [] and Smith-Waterman algorithm [2]. Since this method only utilizes the sequence information, the quality of the alignment will drop greatly when the sequence identity is less than 30%. Multiple sequence alignments create alignment between more than three sequences. Since simultaneous alignment of several sequences is a NP-hard computational problem, most of the methods use a heuristic algorithm, such as ClustalW [3], DIALIGN [4] and T-COFFEE [5]. However, the alignment quality and computational cost are two critical problems in this kind method. Profile-based methods have greatly accelerated with the development of the PSI-BLAST program by Alschul et al [6]. These methods improve the alignment quality by using a profile to describe the characters in the similar sequences and aligning a sequence or a profile with other profile. Because the profile accurate records the most relevant information from the multiple sequence alignment, the quality of this method is better than the others. Several groups have published profile-to-profile alignment methods, such as PSI-BLAST [6] and HMMER [7]. Most of profile-based methods use standard Smith-Waterman local alignment method, but they vary significantly in a number of important respects, such as scoring functions, gap penalties, weighting schemes and whether adding a secondary-structure substitution matrix. Although all these methods use different information and different methodologies, an accurate alignment still remains a major challenge, especially when the sequence similarity fell into the twilight zone (<=25% sequence identity). This is largely resulted from the fact that often it is very difficult to obtain a correct scoring matrix where not only mutations but more importantly insertions and deletions /06/$ IEEE

2 occurring during evolution. It is generally accepted that structural alignment based only on the three-dimensional coordinates would accurately represent the corresponding residues as well as the boundary and site of any gaps. As we stated above, sequence alignment is a critical factor in protein structure prediction. Also, it is important to note that existing modeling techniques still use sequence alignment to select structural templates, however these techniques suffer from several shortcomings, which limit the practical applicability of comparative modeling:. Since the number of the known structures is smaller than that of sequences greatly, the lack of templates is a big problem. 2. The query-template alignment quality drops greatly when the sequence identities fall blow 30%. Facing the first problem, we designed and constructed a domain-based templates database in our previous study. For each template in this database, we superimposed the corresponding structures and provided a multiple structure alignment based only on the three-dimensional information of structure involved. The detail is described in [8]. Facing the second problem, we present a method to extract a profile from the multiple structure alignment of each template in our database. Also we develop a query-template alignment method using the profile. Preliminary experimental results show that our profile-based alignment method significant improves the accuracy of selecting structural template. The organization of this paper is as follows. The next section briefly reviews the construction of domain-based template database. The subsequent section describes the profile-based alignment using our domain-based template. Then we provide and discuss experimental results on some datasets. Finally, we present the main conclusion of the paper and discuss for the future work. II. CONSTRUCTUION DOMAIN-BASED TEMPLATE DATABASE Since proteins evolve with their structural and functional domains as independent units, proteins and their structures can be largely described as combinations of conserved protein domains. This motivates us to construct a domain-based template database to increase the likelihood of widely applicable structure templates. We first searched all the InterPro [9] domains in PDB [0] using the program iprscan software [], and mapped the corresponding protein structures in the PDB. All the PDB protein sequences in this project were parsed directly from the structural records reorganized by MSD database. From the protein family and superfamily databases, such as Pfam [2], SCOP [3], SMART [4] and TIGERFAM [5], we then used the program HMMER [7] to obtain the consensus sequences for each InterPro domain. Next, we partitioned the structural correspondences from the PDB for each InterPro domain, and constructed a primary domain cluster. For each domain cluster, we compared all the sequences of domains involved with the relevant consensus sequence based on sequence similarity, then chose and refined the domain cluster by removing the structure whose sequence identity or structure similarity is less than a pre-defined threshold. Since all the domains in each of clusters are conserved in both sequence and structure, we adopt the conserved structures as template for the relevant domain cluster. The domain based template database can be accessed by website: A B Fig.. Two typical structural ensembles of conserved domain clusters Two typical structure ensembles are illustrated in Fig.. Ensemble A shows the structure ensemble of domain cluster IPR00008, and includes 8 individual structures. All the RMSD of the structures involved are less than Å. Ensemble B shows the structure ensemble of PDZ domain cluster (IPR00478) which including 4 structures. All the RMSD of the structures involved are less than 3 Å. To highlight the conserved structural regions in each domain cluster, we superimposed these conserved structures

3 using Dali [6] or CE [7] algorithm. Based on the conserved structural ensemble for each domain cluster, we also generated a multiple structural alignment for each domain cluster, purely from the backbone coordinates of residues. Since this structural alignment is independent of the sequence similarity, it provides more sensitive and position-specific signatures than the sequence alignment. The detail of construction the database is described in [8]. III. PROFILE-BASED ALIGNMENT USING THE DOMAIN-BASED TEMPLATE DATABASE As we know, an accurate and complete query-templates alignment is very critical in comparative modeling. Most of current modeling techniques are based on sequence information to generate the query-templates alignment, ignoring structure information. Pair-wise alignment algorithm such as Smith-Waterman, FASTA [8] and BLAST [9] can not capture the full joint information content of the group even when the multiple-alignment consensus sequence is used as the query. Since Gribskov first introduced the idea of profiles to search database [20], sequence alignment profiles have been shown to be very powerful in creating accurate sequence alignment. In recent years, many strategies, including structural information and surface accessibility, were proposed to determine the profile or the position-specific gap penalties. Since our template database provided a multiple structural alignment and a superimposed structure ensemble for each domain cluster (template), we strongly believe that we can build more reliable and sensitive profiles, using both the multiple structure alignment and relevant structure information. In order to build a profile for each domain, we analyzed the sequence and structure characters of the database by a statistics way, since we believe that most domains in the database are conserved in sequence and structure. Then a profile for each domain cluster can be built based on the multiple structure alignment and relevant structure information. The query sequence is then searched the database by the position-specific scoring matrix of each domain cluster to valid our profile-based alignment. A. Conservation Statistics from Sequence and Structure Information in Domain Database In order to build the profile for each template from its sequence and structure information, we first analyzed the relationship of the residues and structural information in the template database by a statistics way. For each position of amino acid in each template, we classified the coordinates and residue type based on the superimposed structure ensemble. Here, the spatial location of the alpha carbon atom is considered as presentation of each amino acid. Then, we extracted the alpha carbon atoms from each structure, aligned these positions and clustered them by calculating the distance between each two positions. We used a hierarchical clustering algorithm to implement it, and set Å as distance cutoff. We listed some statistics results for several domain clusters in the database, as shown in table I. In the table, column is the domain cluster ID, column 2 is the percent of only one amino acid type in one coordinate cluster, and column 3 to 7 are the percent of two to six residue types in one coordinate cluster respectively. The table indicated the distribution of the number of amino acid type in each coordinate cluster. It shows that amino acids in one coordinate cluster are more likely to be one residue type, when using cutoff Å to cluster the positions. TABLE I SOME STATISTICS RESULTS FROM THE DATABASE ID IPR IPR IPR IPR IPR IPR IPR IPR IPR Fig.2 shows the same analysis results for the whole database, one residue type in a coordinate cluster is 95.66% for the entire domain clusters, two residue types is 3.63%, three residue types is 0.49%, and more than four residue

4 types in one cluster is only 0.23%. Therefore, an overwhelming majority of the amino acids in one position belong to the same residue type. This conservative property between sequential and structural information can help us to build more accurate profile. Fig 2. The distribution of different types in one coordinate cluster B. Building Profile from Sequence and Structure Information Based on the sequence and structure information, we build a profile for each domain cluster. The profile is defined as a sequence position-specific scoring matrix M(p,a) composed of 2 columns and m rows (m = length of alignment). The first 20 columns of each row specify the score of the 20 amino acid residues respectively. An additional column contains a penalty for insertions or deletions at that position. In position p of alignment A (N structures), AA(a) is defined as the class of amino acid type a, SS(i) is the class of carbon alpha coordinates clustering i (which is mentioned in the last section) and the W(p,a) is the weight for the appearance of amino acid a at position p. For the sequence information, the weight of each amino acid type is determined as follows: Supposed that there are n(a) items in residue class AA(a), then the average weight for class AA(a) is W (p,a) = n(a)/n. For the structure information, the weight for each class is determined as follows: Supposed that there are n(s i ) items in class SS(i), then the weight is W 2 (p,s i ) = n(s i )/N. Then, the W(p,a) can be calculated with W and W 2. W ( p, a) = [ W ( p, a) * AllSS ( i) n( a, i) * W 2 ( p, Here, σ is a normalized unit which ensures that W ( p, a) =. a { a mino acid type} s i.00% 2.00% 3.00% >3% )]* σ And n(a,i) is the number of class SS(i) at position a. Then the position-specific scoring matrix M(p,a) is made by the equation that: M ( p, a) = 20 b= W ( p, b) * Y( a, b) Where Y(a,b) is a scoring matrix, such as BLOSUM62. The profile specific position-dependent penalties for insertions and deletions can be set a high value to prevent insertions in positions where no gaps occurs and set a low value to allow insertions in regions where insertions are observed in the alignment. The penalty applied, gap(l), for creating a gap during the match of profile to query is given by gap(l) = gap [gap_open+gap_ext*l], in which gap is the penalty given in the last column of the profile, L is the number of residue positions in the gap, and gap_open and gap_ext are the penalties for gap opening and gap extension, respectively. C. profile-based alignment Since our profile accurate record both sequence and structure properties, with the profile of each domain cluster, we can use the Smith-Waterman local alignment algorithm to find which domain the query sequence more likely belongs to. The major difference of our profile-based alignment from dynamic programming algorithm and other profile-based alignment algorithms lies in the scoring scheme. Our profile-based alignment uses not only the sequence information derived from domain cluster, but also uses the structure information extracted from superimposed structure ensembles, whereas, in the raw dynamic programming algorithm, the score is based on the comparison of amino acids in the corresponding positions in two sequences, other profile-based alignment algorithms mostly use the sequence information derived by family sequences. IV. RESULTS AND DISCUSSIONS To evaluate the performance of the alignment scheme described in this paper, we tested it within the whole database. There is a reference sequence whose structural distance between others in one domain cluster is the

5 smallest. Also we selected the sequence whose structural distance is the remotest to the reference as the benchmark. Then the benchmark sequence was searched by our profile-based alignment algorithm with the whole database. With the statistics information got from the database, we classified the domain clusters into four types: sequence and structure conserved; structure conserved; sequence conserved and mixed. The conservation is defined as the number of amino acid type or 3D coordinates cluster less than half of the total number at each position in the alignment. The sequence and structure conserved, is the domain cluster whose amino acid type and 3D coordinates are both conserved; the structural conserved, is the domain which only the 3-D coordinates are conserved; the third one is only the amino acid type accord with the conserved condition; the last one consists of both amino acid conserved parts and 3D coordinates conserved parts. We listed the number of domain cluster in each type, as shown in table II. TABLE II. THE NUMBER OF DOMAINS CLUSTER IN EACH TYPE. Class type Number of Domain cluster sequential and structural conserved 05 structural conserved 28 the amino acid type number to weight on that type. Using these 3 different alignment methods, we compared the query (total,05 datasets) with the entire database, table III shows the number of hits and false for each method. From the table we can see that profile-based method improves the alignment significance. Using the consensus sequences aligned by Smith-Waterman algorithm, we can only got 788 (~25%) hits in this cluster type. It has high false rate. Sequence-based profile brings the sequence information into the profile and scoring scheme, it improves the hit rate up to 88%, but there still remain 22 false hits. Our profile contains not only the sequence information but also the structure information, so it can improve the hit rate up to 9%. TABLE III THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 788 (75%) 263 (25%) Sequence-based profile alignment 929 (88%) 22 (2%) Combined profile alignment 952 (9%) 99 ( 9%) A sequential conserved 784 mixed type 974 We picked up some domain clusters from each type to evaluate our score scheme of alignment algorithm. In each domain cluster, we selected a query and then aligned it to the whole database. A. Sequence and Structure Conserved Domain It can be said that domain cluster in this type is the most conserved one. Within this type, our scoring scheme can reflect the conservative features. To evaluate the alignment significant, we compared our profile-based alignment with pure Smith-Waterman sequence alignment and sequence-based profile alignment algorithms. The score matrix in Smith-Waterman algorithm is BLOSUM62. The gap open and gap extension is 2 and 2 respectively. The sequence-based profile alignment is one normal profile-based alignment. It builds the profile by counting B C Fig 2. (A), a segment of the multiple structure alignment in cluster IPR (B), the relevant structure superposition. (C), the alignment between query and cluster IPR by Smith-Waterman algorithm. Number of Entries Score Fig 3. Distribution of alignment scores for comparing a query from IPR00029 with the whole database.

6 Fig.2 and Fig.3 demonstrate another example that our method has more sensitivity than other two methods. Here, we chose a query, labeled 2dln_248_276, from the domain cluster IPR Fig. 2A and 2B show a segment of the multiple structure alignment and the relevant structure superposition in the cluster. We can find that these domains very similar in structure level but have some difference in sequence level. Also, we note that there is a domain in cluster IPR has a segment, which is sequential identity with the query, as shown in Fig. 2C. So both Smith-Waterman and sequence-based profile alignment identified the query belongs to cluster IPR However, our profile-based method can distinguish the query form other clusters. Fig.3 shows that the alignment scores for comparing the query with the whole database. In this figure, the highest score is 60, which is the alignment score between the query and the profile of cluster IPR00029, the right domain cluster. So using our profile, we can improve significantly the alignment sensitivity. the results with the structure information. We chose a query, labeled blba_75, form the domain cluster IPR Fig.4A shows the structure superposition in the domain cluster. Since the amino acid type in some positions is variable, both Smith-Waterman and sequence-based profile alignment methods gave the highest score to the consensus of cluster IPR0024. Although some segment in the alignment were matched well, as shown in Fig.4B, the result was wrong. However our profile-based alignment method can give the highest scores to the consensus of right domain cluster, as shown in Fig.5. The highest score is 9, which is the alignment score between the query and the profile of cluster IPR B. Structure Conserved Domain The domain in this type is only structural conserved. The structure topology in one domain cluster takes on the same shape. But their amino acid type in some position is variable. In biology the amino acid type can be mutated while structure and function is the same. This phenomenon is difficult to handle with sequence alignment schemes, such as local, global or sequence-based profile alignment. TABLE Ⅳ THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 20 (7%) 8 (29%) Sequence-based profile alignment 23 (82%) 5 (8%) Combined profile alignment 27 (96%) ( 4%) In this cluster type, we tested the 3 kinds of alignment methods, the hits and false results were shown in table Ⅳ. Although there are only 28 domain clusters in this type, the results still show that our profile-based method can improve the hit rate. Because the profile reflects the characters of a family, sequence-based profile method improves the hit rate a little. Furthermore, combined profile method improves Fig. 4. (A), the structure superposition in cluster IPR (B) the alignment between query and consensus of cluster IPR0024. Number of Entries Score Fig. 5. The distribution of alignment scores for comparing the query from cluster IPR00064 with the whole database. C. Sequential Conserved Domain TABLE V THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 665 (85%) 9 (5%) Sequence-based profile alignment 735 (94%) 49 ( 6%) Combined profile alignment 745 (95%) 39 ( 5%) There are 784 domain clusters belong to this type in our database. Table V shows the results to compare the 3 kind

7 of alignment methods. Here we selected a query from domain cluster IPR Fig.6A shows that there are some variable regions in these domains, and Fig.6B shows that the sequences are more conserved. Fig.6C shows the highest alignment score is 57, between the query to the profile form cluster IPR00356, whereas other two methods implied that the query belongs to the cluster IPR Therefore, our scheme proved again to improve the alignment results and sensitivity. veracity through combining the sequence and structure information, although there is only one percent improve in hit rate than sequence-based profile alignment. We selected randomly a query from cluster IPR in this type. Fig. 7 shows the distribution of alignment scores for comparing the query with the whole database. The highest score implies that the query belongs to cluster IPR This figure shows again that our method have more sensitivity than other 2 methods. TABLE VI THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman Sequence-based profile alignment A Combined profile alignment Number of Entries 00 0 B Score Number of Entries Score Fig6. (A), the structure superposition and (B) multiple structure alignment in domain cluster IPR (C). the distribution of alignment scores for comparing the query from cluster IPR00356 with the C whole database. D. Mixed Type Table Ⅵ shows the result comparison of three methods using the mixed type datasets. Because there are some variable regions in sequence level in this kind of domain cluster, the Smith-Waterman algorithm behaves much worse than others. Our profile-based method can improve the Fig 7. The distribution of alignment scores for comparing the query from cluster IPR with the whole database. V. CONCLUSION AND FUTURE WORKS In this paper we proposed a profile-based alignment algorithm, used to our domain-based template database. The statistics analysis shows that most of the domain clusters in our database are conserved both in structural and sequential level, so each element in our profile combines the structural clustering information and the sequence information. With this profile, we developed a profile-based query-template alignment method. To validate if our method is more accurate and sensitivity than other query-template alignment methods, we divided our database into four types, based on sequence and structure conservation. In each type, we made some experiments. The results form each type show that our profile can accurate describe the feature of that domain cluster, as well

8 as, our profile-based method can align the query to right template with low-fault. It show that our method have more sensitivity than other query-template alignment methods. As described above, our final goal is protein structure prediction. So, how to use our domain-based template database and our profile-based query-template alignment method to improve the prediction of protein structure will be investigated in our next work. ACKNOWLEDGMENT This work was supported by the National Natural Science Foundation of China project under and key project under REFERENCES [] Needleman S, Wunsch C, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 997, vol.48, p [2] Smith T, Waterman M, Identification of common molecular subsequences, J Mol Biol, 98, vol.47, p [3] J. Thompson, D. Higgins, and T. Gibson, CLUSTALW: improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting Position Specific Gap Penalties and Weight Matrix Choice, Nucleic Acids Res, 994, vol. 22, p [4] Michael Brudno, Michael Chapman, Berthold Gottgens, Serafim Batzoglou and Burkhard Morgenstern, Fast and sensitive multiple Res, (Database Issue): p. D [0] Berman, H.M., et al., The Protein Data Bank. Acta Crystallogr D Biol Crystallogr, (Pt 6 No ): p [] Zdobnov, E.M. and R. Apweiler, InterProScan--an integration platform for the signature-recognition methods in InterPro, Bioinformatics, 200. vol.7(9), p [2] Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, (Database issue): p. D38-4. [3] Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, (4): p [4] Letunic, I., et al., SMART 4.0: towards genomic data integration. Nucleic Acids Res, (Database issue): p. D [5] Haft. D.H., J.D. Selengut, and O. White, The TIGRFAMs database of protein families, Nucleic Acids Res, 2003, 3(), p [6] Holm. L. and C. Sander, Protein structure comparison by alignment of distance matrices, J Mol Biol (), p [7] Shindyalov IN, Bourne PE, Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Protein Engineering, 998, vol. (9), p [8] Pearson. W.R, Rapid and sensitive sequence comparison with FASTP and FASTA, Methods Enzymol, 990, vol.83, p [9] S. F. Altschul, W. Gish, W. miller, E. W. Myers and D. J. Lipman, Basic Local Alignment Search Tool, J. Mol. Biol , p [20] Gribskov, M., McLachlan, A.D., and Eisenberg, D, Profile analysis: Detection of distantly related proteins, Proc. Natl. Acad. Sci, 987 vol.84, p alignment of large genomic sequences, Bioinformatics 2003, vol.4, p [5] C. Notredame, D. Higgins, J. Heringa, T-Coffee: A novel method for multiple sequence alignments, J Mol Biol, 2000, vol.302, p [6] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z. and et al, Gapped BLAST and PSI-BLAST: A new generation of database programs, Nucleic Acids Res, 997, vol.25, p [7] SR Eddy, Profile hidden markov models, Bioinformatics, 998, Vol 4, p [8] Fa Zhang, Jingchun Chen, Zhiyong Liu and Bo Yuan, The construction of Structural Templates for the Modeling of Conserved Protein Domains, International Conference on Bioinformatics and its Applications(ICBA 04), Fort Lauderdle. Florida. USA. [9] Mulder, N.J., et al., InterPro, progress and status in Nucleic Acids

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Do Aligned Sequences Share the Same Fold?

Do Aligned Sequences Share the Same Fold? J. Mol. Biol. (1997) 273, 355±368 Do Aligned Sequences Share the Same Fold? Ruben A. Abagyan* and Serge Batalov The Skirball Institute of Biomolecular Medicine Biochemistry Department NYU Medical Center

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing Bioinformatics Proteins II. - Pattern, Profile, & Structure Database Searching Robert Latek, Ph.D. Bioinformatics, Biocomputing WIBR Bioinformatics Course, Whitehead Institute, 2002 1 Proteins I.-III.

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

proteins Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * INTRODUCTION

proteins Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * INTRODUCTION proteins STRUCTURE O FUNCTION O BIOINFORMATICS Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * 1 Department of Biological Sciences, College

More information

The PRALINE online server: optimising progressive multiple alignment on the web

The PRALINE online server: optimising progressive multiple alignment on the web Computational Biology and Chemistry 27 (2003) 511 519 Software Note The PRALINE online server: optimising progressive multiple alignment on the web V.A. Simossis a,b, J. Heringa a, a Bioinformatics Unit,

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

More information

Probalign: Multiple sequence alignment using partition function posterior probabilities

Probalign: Multiple sequence alignment using partition function posterior probabilities Sequence Analysis Probalign: Multiple sequence alignment using partition function posterior probabilities Usman Roshan 1* and Dennis R. Livesay 2 1 Department of Computer Science, New Jersey Institute

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS Aslı Filiz 1, Eser Aygün 2, Özlem Keskin 3 and Zehra Cataltepe 2 1 Informatics Institute and 2 Computer Engineering Department,

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Functional diversity within protein superfamilies

Functional diversity within protein superfamilies Functional diversity within protein superfamilies James Casbon and Mansoor Saqi * Bioinformatics Group, The Genome Centre, Barts and The London, Queen Mary s School of Medicine and Dentistry, Charterhouse

More information

The use of evolutionary information in protein alignments and homology identification.

The use of evolutionary information in protein alignments and homology identification. The use of evolutionary information in protein alignments and homology identification. TOMAS OHLSON Stockholm Bioinformatics Center Department of Biochemistry and Biophysics Stockholm University Sweden

More information

The typical end scenario for those who try to predict protein

The typical end scenario for those who try to predict protein A method for evaluating the structural quality of protein models by using higher-order pairs scoring Gregory E. Sims and Sung-Hou Kim Berkeley Structural Genomics Center, Lawrence Berkeley National Laboratory,

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

proteins Refinement by shifting secondary structure elements improves sequence alignments

proteins Refinement by shifting secondary structure elements improves sequence alignments proteins STRUCTURE O FUNCTION O BIOINFORMATICS Refinement by shifting secondary structure elements improves sequence alignments Jing Tong, 1,2 Jimin Pei, 3 Zbyszek Otwinowski, 1,2 and Nick V. Grishin 1,2,3

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Subfamily HMMS in Functional Genomics D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Pacific Symposium on Biocomputing 10:322-333(2005) SUBFAMILY HMMS IN FUNCTIONAL GENOMICS DUNCAN

More information

Protein structure analysis. Risto Laakso 10th January 2005

Protein structure analysis. Risto Laakso 10th January 2005 Protein structure analysis Risto Laakso risto.laakso@hut.fi 10th January 2005 1 1 Summary Various methods of protein structure analysis were examined. Two proteins, 1HLB (Sea cucumber hemoglobin) and 1HLM

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

The molecular functions of a protein can be inferred from

The molecular functions of a protein can be inferred from Global mapping of the protein structure space and application in structure-based inference of protein function Jingtong Hou*, Se-Ran Jun, Chao Zhang, and Sung-Hou Kim* Department of Chemistry and *Graduate

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

ProtoNet 4.0: A hierarchical classification of one million protein sequences

ProtoNet 4.0: A hierarchical classification of one million protein sequences ProtoNet 4.0: A hierarchical classification of one million protein sequences Noam Kaplan 1*, Ori Sasson 2, Uri Inbar 2, Moriah Friedlich 2, Menachem Fromer 2, Hillel Fleischer 2, Elon Portugaly 2, Nathan

More information

Generalized Affine Gap Costs for Protein Sequence Alignment

Generalized Affine Gap Costs for Protein Sequence Alignment PROTEINS: Structure, Function, and Genetics 32:88 96 (1998) Generalized Affine Gap Costs for Protein Sequence Alignment Stephen F. Altschul* National Center for Biotechnology Information, National Library

More information

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species Paulo Bandiera-Paiva 1 and Marcelo R.S. Briones 2 1 Departmento de Informática em Saúde

More information

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties Lecture 1, 31/10/2001: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties 1 Computational sequence-analysis The major goal of computational

More information

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8. Multiple sequence alignment Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database

More information

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment Ofer Gill and Bud Mishra Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street,

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Unsupervised Learning in Spectral Genome Analysis

Unsupervised Learning in Spectral Genome Analysis Unsupervised Learning in Spectral Genome Analysis Lutz Hamel 1, Neha Nahar 1, Maria S. Poptsova 2, Olga Zhaxybayeva 3, J. Peter Gogarten 2 1 Department of Computer Sciences and Statistics, University of

More information

Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis

Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis Fang XY, Luo ZG, Wang ZH. Predicting RNA secondary structure using profile stochastic context-free grammars and phylogenic analysis. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(4): 582 589 July 2008

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in

More information

The sequences of naturally occurring proteins are defined by

The sequences of naturally occurring proteins are defined by Protein topology and stability define the space of allowed sequences Patrice Koehl* and Michael Levitt Department of Structural Biology, Fairchild Building, D109, Stanford University, Stanford, CA 94305

More information

Integrating multi-attribute similarity networks for robust representation of the protein space

Integrating multi-attribute similarity networks for robust representation of the protein space BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 13 2006, pages 1585 1592 doi:10.1093/bioinformatics/btl130 Structural bioinformatics Integrating multi-attribute similarity networks for robust representation

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

Overview Multiple Sequence Alignment

Overview Multiple Sequence Alignment Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments

More information

Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

More information

ALL LECTURES IN SB Introduction

ALL LECTURES IN SB Introduction 1. Introduction 2. Molecular Architecture I 3. Molecular Architecture II 4. Molecular Simulation I 5. Molecular Simulation II 6. Bioinformatics I 7. Bioinformatics II 8. Prediction I 9. Prediction II ALL

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

More information

Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5]

Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5] Example questions for Bioinformatics, first semester half Sommersemester 00 ote The schriftliche Klausur wurde auf deutsch geschrieben The questions will be based on material from the Übungen and the Lectures.

More information

Topics in Computational Biology and Genomics

Topics in Computational Biology and Genomics Topics in Computational Biology and Genomics {MCB, PMB, BioE}{c146, c246} University of California, Berkeley Spring 2005 Instruction and discussion of topics in genomics and computational biology. Working

More information

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 2 Amino Acid Structures from Klug & Cummings

More information

Administration. ndrew Torda April /04/2008 [ 1 ]

Administration. ndrew Torda April /04/2008 [ 1 ] ndrew Torda April 2008 Administration 22/04/2008 [ 1 ] Sprache? zu verhandeln (Englisch, Hochdeutsch, Bayerisch) Selection of topics Proteins / DNA / RNA Two halves to course week 1-7 Prof Torda (larger

More information

IT og Sundhed 2010/11

IT og Sundhed 2010/11 IT og Sundhed 2010/11 Sequence based predictors. Secondary structure and surface accessibility Bent Petersen 13 January 2011 1 NetSurfP Real Value Solvent Accessibility predictions with amino acid associated

More information

SimShift: Identifying structural similarities from NMR chemical shifts

SimShift: Identifying structural similarities from NMR chemical shifts BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 4 2006, pages 460 465 doi:10.1093/bioinformatics/bti805 Structural bioinformatics SimShift: Identifying structural similarities from NMR chemical shifts Simon

More information

Ant Colony Approach to Predict Amino Acid Interaction Networks

Ant Colony Approach to Predict Amino Acid Interaction Networks Ant Colony Approach to Predict Amino Acid Interaction Networks Omar Gaci, Stefan Balev To cite this version: Omar Gaci, Stefan Balev. Ant Colony Approach to Predict Amino Acid Interaction Networks. IEEE

More information

Lecture 4: September 19

Lecture 4: September 19 CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Structure to Function. Molecular Bioinformatics, X3, 2006

Structure to Function. Molecular Bioinformatics, X3, 2006 Structure to Function Molecular Bioinformatics, X3, 2006 Structural GeNOMICS Structural Genomics project aims at determination of 3D structures of all proteins: - organize known proteins into families

More information

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013 EBI web resources II: Ensembl and InterPro Yanbin Yin Spring 2013 1 Outline Intro to genome annotation Protein family/domain databases InterPro, Pfam, Superfamily etc. Genome browser Ensembl Hands on Practice

More information

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2017 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article An Artificial

More information

A conserved P-loop anchor limits the structural dynamics that mediate. nucleotide dissociation in EF-Tu.

A conserved P-loop anchor limits the structural dynamics that mediate. nucleotide dissociation in EF-Tu. Supplemental Material for A conserved P-loop anchor limits the structural dynamics that mediate nucleotide dissociation in EF-Tu. Evan Mercier 1,2, Dylan Girodat 1, and Hans-Joachim Wieden 1 * 1 Alberta

More information

Temporal Multi-View Inconsistency Detection for Network Traffic Analysis

Temporal Multi-View Inconsistency Detection for Network Traffic Analysis WWW 15 Florence, Italy Temporal Multi-View Inconsistency Detection for Network Traffic Analysis Houping Xiao 1, Jing Gao 1, Deepak Turaga 2, Long Vu 2, and Alain Biem 2 1 Department of Computer Science

More information

Identification of correct regions in protein models using structural, alignment, and consensus information

Identification of correct regions in protein models using structural, alignment, and consensus information Identification of correct regions in protein models using structural, alignment, and consensus information BJO RN WALLNER AND ARNE ELOFSSON Stockholm Bioinformatics Center, Stockholm University, SE-106

More information

Homology and Information Gathering and Domain Annotation for Proteins

Homology and Information Gathering and Domain Annotation for Proteins Homology and Information Gathering and Domain Annotation for Proteins Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises The concept of homology The

More information

An ant colony algorithm for multiple sequence alignment in bioinformatics

An ant colony algorithm for multiple sequence alignment in bioinformatics An ant colony algorithm for multiple sequence alignment in bioinformatics Jonathan Moss and Colin G. Johnson Computing Laboratory University of Kent at Canterbury Canterbury, Kent, CT2 7NF, England. C.G.Johnson@ukc.ac.uk

More information

Sequence-specific sequence comparison using pairwise statistical significance

Sequence-specific sequence comparison using pairwise statistical significance Graduate Theses and Dissertations Graduate College 2009 Sequence-specific sequence comparison using pairwise statistical significance Ankit Agrawal Iowa State University Follow this and additional works

More information

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Protein Secondary Structure Prediction using Feed-Forward Neural Network COPYRIGHT 2010 JCIT, ISSN 2078-5828 (PRINT), ISSN 2218-5224 (ONLINE), VOLUME 01, ISSUE 01, MANUSCRIPT CODE: 100713 Protein Secondary Structure Prediction using Feed-Forward Neural Network M. A. Mottalib,

More information

Fast and accurate semi-supervised protein homology detection with large uncurated sequence databases

Fast and accurate semi-supervised protein homology detection with large uncurated sequence databases Rutgers Computer Science Technical Report RU-DCS-TR634 May 2008 Fast and accurate semi-supervised protein homology detection with large uncurated sequence databases by Pai-Hsi Huang, Pavel Kuksa, Vladimir

More information

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature17991 Supplementary Discussion Structural comparison with E. coli EmrE The DMT superfamily includes a wide variety of transporters with 4-10 TM segments 1. Since the subfamilies of the

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Template-Based 3D Structure Prediction

Template-Based 3D Structure Prediction Template-Based 3D Structure Prediction Sequence and Structure-based Template Detection and Alignment Issues The rate of new sequences is growing exponentially relative to the rate of protein structures

More information

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology ICB Fall 2003 G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2003 Oliver Jovanovic, All Rights Reserved. Bioinformatics and

More information

Homology modeling of Ferredoxin-nitrite reductase from Arabidopsis thaliana

Homology modeling of Ferredoxin-nitrite reductase from Arabidopsis thaliana www.bioinformation.net Hypothesis Volume 6(3) Homology modeling of Ferredoxin-nitrite reductase from Arabidopsis thaliana Karim Kherraz*, Khaled Kherraz, Abdelkrim Kameli Biology department, Ecole Normale

More information

Local Alignment: Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.

More information

Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures in the Contact Map

Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures in the Contact Map American Journal of Biochemistry and Biotechnology 4 (4): 375-384, 8 ISSN 553-3468 8 Science Publications Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures

More information

Reconstruction of Protein Backbone with the α-carbon Coordinates *

Reconstruction of Protein Backbone with the α-carbon Coordinates * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 26, 1107-1119 (2010) Reconstruction of Protein Backbone with the α-carbon Coordinates * JEN-HUI WANG, CHANG-BIAU YANG + AND CHIOU-TING TSENG Department of

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple Sequence Alignment Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences.! What about more than two? And what for?! A faint similarity between two

More information

Pacific Symposium on Biocomputing 4: (1999)

Pacific Symposium on Biocomputing 4: (1999) OPTIMIZING SMITH-WATERMAN ALIGNMENTS ROLF OLSEN, TERENCE HWA Department of Physics, University of California at San Diego La Jolla, CA 92093-0319 email: rolf@cezanne.ucsd.edu, hwa@ucsd.edu MICHAEL L ASSIG

More information

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy Design of a Novel Globular Protein Fold with Atomic-Level Accuracy Brian Kuhlman, Gautam Dantas, Gregory C. Ireton, Gabriele Varani, Barry L. Stoddard, David Baker Presented by Kate Stafford 4 May 05 Protein

More information

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB Structure-Based Sequence Alignment of the Transmembrane Domains of All Human GPCRs: Phylogenetic, Structural and Functional Implications, Cvicek et al. Supporting Text 1 Here we compare the GRoSS alignment

More information

Identification of motifs with insertions and deletions in protein sequences using self-organizing neural networks

Identification of motifs with insertions and deletions in protein sequences using self-organizing neural networks Identification of motifs with insertions and deletions in protein sequences using self-organizing neural networks Derong Liu a,b,c,, Xiaoxu Xiong a, Zeng-Guang Hou b, Bhaskar DasGupta c a Department of

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

P rotein structure alignment is a fundamental problem in computational structure biology and has been

P rotein structure alignment is a fundamental problem in computational structure biology and has been SUBJECT AREAS: PROTEIN ANALYSIS SOFTWARE STRUCTURAL BIOLOGY BIOINFORMATICS Received 15 November 2012 Accepted 25 February 2013 Published 14 March 2013 Correspondence and requests for materials should be

More information

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Paulo Mologni 1, Ailton Akira Shinoda 2, Carlos Dias Maciel

More information

ChemAlign: Biologically Relevant Multiple Sequence Alignment Using Physicochemical Properties

ChemAlign: Biologically Relevant Multiple Sequence Alignment Using Physicochemical Properties Brigham Young University BYU ScholarsArchive All Faculty Publications 2009-11-01 ChemAlign: Biologically Relevant Multiple Sequence Alignment Using Physicochemical Properties Hyrum Carroll hyrumcarroll@gmail.com

More information

Transductive learning with EM algorithm to classify proteins based on phylogenetic profiles

Transductive learning with EM algorithm to classify proteins based on phylogenetic profiles Int. J. Data Mining and Bioinformatics, Vol. 1, No. 4, 2007 337 Transductive learning with EM algorithm to classify proteins based on phylogenetic profiles Roger A. Craig and Li Liao* Department of Computer

More information

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster. NCBI BLAST Services DELTA-BLAST BLAST (http://blast.ncbi.nlm.nih.gov/), Basic Local Alignment Search tool, is a suite of programs for finding similarities between biological sequences. DELTA-BLAST is a

More information

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids Database searches 1 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids 2 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids (cntd) 3 DNA and protein databases SWISS-PROT

More information

Biological Systems: Open Access

Biological Systems: Open Access Biological Systems: Open Access Biological Systems: Open Access Liu and Zheng, 2016, 5:1 http://dx.doi.org/10.4172/2329-6577.1000153 ISSN: 2329-6577 Research Article ariant Maps to Identify Coding and

More information