A profile-based protein sequence alignment algorithm for a domain clustering database

Size: px
Start display at page:

Download "A profile-based protein sequence alignment algorithm for a domain clustering database"

Transcription

1 A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing Technology, Chinese Academy of Sciences 2, Graduate School of Chinese Academy of Sciences 3, National Natural Science Foundation of China Abstract- Aiming at the two main shortcomings in Homology Modeling, we have designed and established a domain clustering database. Searching the database is a fundamental work for it. However, current alignment algorithms are mainly based on the sequences, ignoring the structure conservation in domain. This paper proposed a profile-based alignment which considers the structure information into the profile, based on the character of our domain database. We designed an experiment within the database. The results show that both the quality and sensitivity of our scheme are better than pure Smith-Waterman and sequence-based profile algorithms. We strongly believe that this work can help to improve the protein structure prediction. I. INTRODUCTION Sequence alignment is a fundamental tool in Computational Biology and Bioinformatics. With this tool we can get a lot of useful information, such as which genes have the same function, which RNAs belong to the same class and which proteins have the same structure topology, etc. Moreover, in the area of protein structure prediction, obtaining the alignment between structure-unknown protein sequence (query) and its structure-known homologies (templates) is the most fundamental step in the modeling processing, and the quality of the alignment affects the prediction result greatly. Generally speaking, there are three categories of methods to create an alignment: single sequence based, multiple sequence alignments and profile based. Single sequence based methods use the standard dynamic programming algorithm to generate the alignment, for example, Needelman-Wunch algorithm [] and Smith-Waterman algorithm [2]. Since this method only utilizes the sequence information, the quality of the alignment will drop greatly when the sequence identity is less than 30%. Multiple sequence alignments create alignment between more than three sequences. Since simultaneous alignment of several sequences is a NP-hard computational problem, most of the methods use a heuristic algorithm, such as ClustalW [3], DIALIGN [4] and T-COFFEE [5]. However, the alignment quality and computational cost are two critical problems in this kind method. Profile-based methods have greatly accelerated with the development of the PSI-BLAST program by Alschul et al [6]. These methods improve the alignment quality by using a profile to describe the characters in the similar sequences and aligning a sequence or a profile with other profile. Because the profile accurate records the most relevant information from the multiple sequence alignment, the quality of this method is better than the others. Several groups have published profile-to-profile alignment methods, such as PSI-BLAST [6] and HMMER [7]. Most of profile-based methods use standard Smith-Waterman local alignment method, but they vary significantly in a number of important respects, such as scoring functions, gap penalties, weighting schemes and whether adding a secondary-structure substitution matrix. Although all these methods use different information and different methodologies, an accurate alignment still remains a major challenge, especially when the sequence similarity fell into the twilight zone (<=25% sequence identity). This is largely resulted from the fact that often it is very difficult to obtain a correct scoring matrix where not only mutations but more importantly insertions and deletions /06/$ IEEE

2 occurring during evolution. It is generally accepted that structural alignment based only on the three-dimensional coordinates would accurately represent the corresponding residues as well as the boundary and site of any gaps. As we stated above, sequence alignment is a critical factor in protein structure prediction. Also, it is important to note that existing modeling techniques still use sequence alignment to select structural templates, however these techniques suffer from several shortcomings, which limit the practical applicability of comparative modeling:. Since the number of the known structures is smaller than that of sequences greatly, the lack of templates is a big problem. 2. The query-template alignment quality drops greatly when the sequence identities fall blow 30%. Facing the first problem, we designed and constructed a domain-based templates database in our previous study. For each template in this database, we superimposed the corresponding structures and provided a multiple structure alignment based only on the three-dimensional information of structure involved. The detail is described in [8]. Facing the second problem, we present a method to extract a profile from the multiple structure alignment of each template in our database. Also we develop a query-template alignment method using the profile. Preliminary experimental results show that our profile-based alignment method significant improves the accuracy of selecting structural template. The organization of this paper is as follows. The next section briefly reviews the construction of domain-based template database. The subsequent section describes the profile-based alignment using our domain-based template. Then we provide and discuss experimental results on some datasets. Finally, we present the main conclusion of the paper and discuss for the future work. II. CONSTRUCTUION DOMAIN-BASED TEMPLATE DATABASE Since proteins evolve with their structural and functional domains as independent units, proteins and their structures can be largely described as combinations of conserved protein domains. This motivates us to construct a domain-based template database to increase the likelihood of widely applicable structure templates. We first searched all the InterPro [9] domains in PDB [0] using the program iprscan software [], and mapped the corresponding protein structures in the PDB. All the PDB protein sequences in this project were parsed directly from the structural records reorganized by MSD database. From the protein family and superfamily databases, such as Pfam [2], SCOP [3], SMART [4] and TIGERFAM [5], we then used the program HMMER [7] to obtain the consensus sequences for each InterPro domain. Next, we partitioned the structural correspondences from the PDB for each InterPro domain, and constructed a primary domain cluster. For each domain cluster, we compared all the sequences of domains involved with the relevant consensus sequence based on sequence similarity, then chose and refined the domain cluster by removing the structure whose sequence identity or structure similarity is less than a pre-defined threshold. Since all the domains in each of clusters are conserved in both sequence and structure, we adopt the conserved structures as template for the relevant domain cluster. The domain based template database can be accessed by website: A B Fig.. Two typical structural ensembles of conserved domain clusters Two typical structure ensembles are illustrated in Fig.. Ensemble A shows the structure ensemble of domain cluster IPR00008, and includes 8 individual structures. All the RMSD of the structures involved are less than Å. Ensemble B shows the structure ensemble of PDZ domain cluster (IPR00478) which including 4 structures. All the RMSD of the structures involved are less than 3 Å. To highlight the conserved structural regions in each domain cluster, we superimposed these conserved structures

3 using Dali [6] or CE [7] algorithm. Based on the conserved structural ensemble for each domain cluster, we also generated a multiple structural alignment for each domain cluster, purely from the backbone coordinates of residues. Since this structural alignment is independent of the sequence similarity, it provides more sensitive and position-specific signatures than the sequence alignment. The detail of construction the database is described in [8]. III. PROFILE-BASED ALIGNMENT USING THE DOMAIN-BASED TEMPLATE DATABASE As we know, an accurate and complete query-templates alignment is very critical in comparative modeling. Most of current modeling techniques are based on sequence information to generate the query-templates alignment, ignoring structure information. Pair-wise alignment algorithm such as Smith-Waterman, FASTA [8] and BLAST [9] can not capture the full joint information content of the group even when the multiple-alignment consensus sequence is used as the query. Since Gribskov first introduced the idea of profiles to search database [20], sequence alignment profiles have been shown to be very powerful in creating accurate sequence alignment. In recent years, many strategies, including structural information and surface accessibility, were proposed to determine the profile or the position-specific gap penalties. Since our template database provided a multiple structural alignment and a superimposed structure ensemble for each domain cluster (template), we strongly believe that we can build more reliable and sensitive profiles, using both the multiple structure alignment and relevant structure information. In order to build a profile for each domain, we analyzed the sequence and structure characters of the database by a statistics way, since we believe that most domains in the database are conserved in sequence and structure. Then a profile for each domain cluster can be built based on the multiple structure alignment and relevant structure information. The query sequence is then searched the database by the position-specific scoring matrix of each domain cluster to valid our profile-based alignment. A. Conservation Statistics from Sequence and Structure Information in Domain Database In order to build the profile for each template from its sequence and structure information, we first analyzed the relationship of the residues and structural information in the template database by a statistics way. For each position of amino acid in each template, we classified the coordinates and residue type based on the superimposed structure ensemble. Here, the spatial location of the alpha carbon atom is considered as presentation of each amino acid. Then, we extracted the alpha carbon atoms from each structure, aligned these positions and clustered them by calculating the distance between each two positions. We used a hierarchical clustering algorithm to implement it, and set Å as distance cutoff. We listed some statistics results for several domain clusters in the database, as shown in table I. In the table, column is the domain cluster ID, column 2 is the percent of only one amino acid type in one coordinate cluster, and column 3 to 7 are the percent of two to six residue types in one coordinate cluster respectively. The table indicated the distribution of the number of amino acid type in each coordinate cluster. It shows that amino acids in one coordinate cluster are more likely to be one residue type, when using cutoff Å to cluster the positions. TABLE I SOME STATISTICS RESULTS FROM THE DATABASE ID IPR IPR IPR IPR IPR IPR IPR IPR IPR Fig.2 shows the same analysis results for the whole database, one residue type in a coordinate cluster is 95.66% for the entire domain clusters, two residue types is 3.63%, three residue types is 0.49%, and more than four residue

4 types in one cluster is only 0.23%. Therefore, an overwhelming majority of the amino acids in one position belong to the same residue type. This conservative property between sequential and structural information can help us to build more accurate profile. Fig 2. The distribution of different types in one coordinate cluster B. Building Profile from Sequence and Structure Information Based on the sequence and structure information, we build a profile for each domain cluster. The profile is defined as a sequence position-specific scoring matrix M(p,a) composed of 2 columns and m rows (m = length of alignment). The first 20 columns of each row specify the score of the 20 amino acid residues respectively. An additional column contains a penalty for insertions or deletions at that position. In position p of alignment A (N structures), AA(a) is defined as the class of amino acid type a, SS(i) is the class of carbon alpha coordinates clustering i (which is mentioned in the last section) and the W(p,a) is the weight for the appearance of amino acid a at position p. For the sequence information, the weight of each amino acid type is determined as follows: Supposed that there are n(a) items in residue class AA(a), then the average weight for class AA(a) is W (p,a) = n(a)/n. For the structure information, the weight for each class is determined as follows: Supposed that there are n(s i ) items in class SS(i), then the weight is W 2 (p,s i ) = n(s i )/N. Then, the W(p,a) can be calculated with W and W 2. W ( p, a) = [ W ( p, a) * AllSS ( i) n( a, i) * W 2 ( p, Here, σ is a normalized unit which ensures that W ( p, a) =. a { a mino acid type} s i.00% 2.00% 3.00% >3% )]* σ And n(a,i) is the number of class SS(i) at position a. Then the position-specific scoring matrix M(p,a) is made by the equation that: M ( p, a) = 20 b= W ( p, b) * Y( a, b) Where Y(a,b) is a scoring matrix, such as BLOSUM62. The profile specific position-dependent penalties for insertions and deletions can be set a high value to prevent insertions in positions where no gaps occurs and set a low value to allow insertions in regions where insertions are observed in the alignment. The penalty applied, gap(l), for creating a gap during the match of profile to query is given by gap(l) = gap [gap_open+gap_ext*l], in which gap is the penalty given in the last column of the profile, L is the number of residue positions in the gap, and gap_open and gap_ext are the penalties for gap opening and gap extension, respectively. C. profile-based alignment Since our profile accurate record both sequence and structure properties, with the profile of each domain cluster, we can use the Smith-Waterman local alignment algorithm to find which domain the query sequence more likely belongs to. The major difference of our profile-based alignment from dynamic programming algorithm and other profile-based alignment algorithms lies in the scoring scheme. Our profile-based alignment uses not only the sequence information derived from domain cluster, but also uses the structure information extracted from superimposed structure ensembles, whereas, in the raw dynamic programming algorithm, the score is based on the comparison of amino acids in the corresponding positions in two sequences, other profile-based alignment algorithms mostly use the sequence information derived by family sequences. IV. RESULTS AND DISCUSSIONS To evaluate the performance of the alignment scheme described in this paper, we tested it within the whole database. There is a reference sequence whose structural distance between others in one domain cluster is the

5 smallest. Also we selected the sequence whose structural distance is the remotest to the reference as the benchmark. Then the benchmark sequence was searched by our profile-based alignment algorithm with the whole database. With the statistics information got from the database, we classified the domain clusters into four types: sequence and structure conserved; structure conserved; sequence conserved and mixed. The conservation is defined as the number of amino acid type or 3D coordinates cluster less than half of the total number at each position in the alignment. The sequence and structure conserved, is the domain cluster whose amino acid type and 3D coordinates are both conserved; the structural conserved, is the domain which only the 3-D coordinates are conserved; the third one is only the amino acid type accord with the conserved condition; the last one consists of both amino acid conserved parts and 3D coordinates conserved parts. We listed the number of domain cluster in each type, as shown in table II. TABLE II. THE NUMBER OF DOMAINS CLUSTER IN EACH TYPE. Class type Number of Domain cluster sequential and structural conserved 05 structural conserved 28 the amino acid type number to weight on that type. Using these 3 different alignment methods, we compared the query (total,05 datasets) with the entire database, table III shows the number of hits and false for each method. From the table we can see that profile-based method improves the alignment significance. Using the consensus sequences aligned by Smith-Waterman algorithm, we can only got 788 (~25%) hits in this cluster type. It has high false rate. Sequence-based profile brings the sequence information into the profile and scoring scheme, it improves the hit rate up to 88%, but there still remain 22 false hits. Our profile contains not only the sequence information but also the structure information, so it can improve the hit rate up to 9%. TABLE III THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 788 (75%) 263 (25%) Sequence-based profile alignment 929 (88%) 22 (2%) Combined profile alignment 952 (9%) 99 ( 9%) A sequential conserved 784 mixed type 974 We picked up some domain clusters from each type to evaluate our score scheme of alignment algorithm. In each domain cluster, we selected a query and then aligned it to the whole database. A. Sequence and Structure Conserved Domain It can be said that domain cluster in this type is the most conserved one. Within this type, our scoring scheme can reflect the conservative features. To evaluate the alignment significant, we compared our profile-based alignment with pure Smith-Waterman sequence alignment and sequence-based profile alignment algorithms. The score matrix in Smith-Waterman algorithm is BLOSUM62. The gap open and gap extension is 2 and 2 respectively. The sequence-based profile alignment is one normal profile-based alignment. It builds the profile by counting B C Fig 2. (A), a segment of the multiple structure alignment in cluster IPR (B), the relevant structure superposition. (C), the alignment between query and cluster IPR by Smith-Waterman algorithm. Number of Entries Score Fig 3. Distribution of alignment scores for comparing a query from IPR00029 with the whole database.

6 Fig.2 and Fig.3 demonstrate another example that our method has more sensitivity than other two methods. Here, we chose a query, labeled 2dln_248_276, from the domain cluster IPR Fig. 2A and 2B show a segment of the multiple structure alignment and the relevant structure superposition in the cluster. We can find that these domains very similar in structure level but have some difference in sequence level. Also, we note that there is a domain in cluster IPR has a segment, which is sequential identity with the query, as shown in Fig. 2C. So both Smith-Waterman and sequence-based profile alignment identified the query belongs to cluster IPR However, our profile-based method can distinguish the query form other clusters. Fig.3 shows that the alignment scores for comparing the query with the whole database. In this figure, the highest score is 60, which is the alignment score between the query and the profile of cluster IPR00029, the right domain cluster. So using our profile, we can improve significantly the alignment sensitivity. the results with the structure information. We chose a query, labeled blba_75, form the domain cluster IPR Fig.4A shows the structure superposition in the domain cluster. Since the amino acid type in some positions is variable, both Smith-Waterman and sequence-based profile alignment methods gave the highest score to the consensus of cluster IPR0024. Although some segment in the alignment were matched well, as shown in Fig.4B, the result was wrong. However our profile-based alignment method can give the highest scores to the consensus of right domain cluster, as shown in Fig.5. The highest score is 9, which is the alignment score between the query and the profile of cluster IPR B. Structure Conserved Domain The domain in this type is only structural conserved. The structure topology in one domain cluster takes on the same shape. But their amino acid type in some position is variable. In biology the amino acid type can be mutated while structure and function is the same. This phenomenon is difficult to handle with sequence alignment schemes, such as local, global or sequence-based profile alignment. TABLE Ⅳ THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 20 (7%) 8 (29%) Sequence-based profile alignment 23 (82%) 5 (8%) Combined profile alignment 27 (96%) ( 4%) In this cluster type, we tested the 3 kinds of alignment methods, the hits and false results were shown in table Ⅳ. Although there are only 28 domain clusters in this type, the results still show that our profile-based method can improve the hit rate. Because the profile reflects the characters of a family, sequence-based profile method improves the hit rate a little. Furthermore, combined profile method improves Fig. 4. (A), the structure superposition in cluster IPR (B) the alignment between query and consensus of cluster IPR0024. Number of Entries Score Fig. 5. The distribution of alignment scores for comparing the query from cluster IPR00064 with the whole database. C. Sequential Conserved Domain TABLE V THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 665 (85%) 9 (5%) Sequence-based profile alignment 735 (94%) 49 ( 6%) Combined profile alignment 745 (95%) 39 ( 5%) There are 784 domain clusters belong to this type in our database. Table V shows the results to compare the 3 kind

7 of alignment methods. Here we selected a query from domain cluster IPR Fig.6A shows that there are some variable regions in these domains, and Fig.6B shows that the sequences are more conserved. Fig.6C shows the highest alignment score is 57, between the query to the profile form cluster IPR00356, whereas other two methods implied that the query belongs to the cluster IPR Therefore, our scheme proved again to improve the alignment results and sensitivity. veracity through combining the sequence and structure information, although there is only one percent improve in hit rate than sequence-based profile alignment. We selected randomly a query from cluster IPR in this type. Fig. 7 shows the distribution of alignment scores for comparing the query with the whole database. The highest score implies that the query belongs to cluster IPR This figure shows again that our method have more sensitivity than other 2 methods. TABLE VI THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman Sequence-based profile alignment A Combined profile alignment Number of Entries 00 0 B Score Number of Entries Score Fig6. (A), the structure superposition and (B) multiple structure alignment in domain cluster IPR (C). the distribution of alignment scores for comparing the query from cluster IPR00356 with the C whole database. D. Mixed Type Table Ⅵ shows the result comparison of three methods using the mixed type datasets. Because there are some variable regions in sequence level in this kind of domain cluster, the Smith-Waterman algorithm behaves much worse than others. Our profile-based method can improve the Fig 7. The distribution of alignment scores for comparing the query from cluster IPR with the whole database. V. CONCLUSION AND FUTURE WORKS In this paper we proposed a profile-based alignment algorithm, used to our domain-based template database. The statistics analysis shows that most of the domain clusters in our database are conserved both in structural and sequential level, so each element in our profile combines the structural clustering information and the sequence information. With this profile, we developed a profile-based query-template alignment method. To validate if our method is more accurate and sensitivity than other query-template alignment methods, we divided our database into four types, based on sequence and structure conservation. In each type, we made some experiments. The results form each type show that our profile can accurate describe the feature of that domain cluster, as well

8 as, our profile-based method can align the query to right template with low-fault. It show that our method have more sensitivity than other query-template alignment methods. As described above, our final goal is protein structure prediction. So, how to use our domain-based template database and our profile-based query-template alignment method to improve the prediction of protein structure will be investigated in our next work. ACKNOWLEDGMENT This work was supported by the National Natural Science Foundation of China project under and key project under REFERENCES [] Needleman S, Wunsch C, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 997, vol.48, p [2] Smith T, Waterman M, Identification of common molecular subsequences, J Mol Biol, 98, vol.47, p [3] J. Thompson, D. Higgins, and T. Gibson, CLUSTALW: improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting Position Specific Gap Penalties and Weight Matrix Choice, Nucleic Acids Res, 994, vol. 22, p [4] Michael Brudno, Michael Chapman, Berthold Gottgens, Serafim Batzoglou and Burkhard Morgenstern, Fast and sensitive multiple Res, (Database Issue): p. D [0] Berman, H.M., et al., The Protein Data Bank. Acta Crystallogr D Biol Crystallogr, (Pt 6 No ): p [] Zdobnov, E.M. and R. Apweiler, InterProScan--an integration platform for the signature-recognition methods in InterPro, Bioinformatics, 200. vol.7(9), p [2] Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, (Database issue): p. D38-4. [3] Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, (4): p [4] Letunic, I., et al., SMART 4.0: towards genomic data integration. Nucleic Acids Res, (Database issue): p. D [5] Haft. D.H., J.D. Selengut, and O. White, The TIGRFAMs database of protein families, Nucleic Acids Res, 2003, 3(), p [6] Holm. L. and C. Sander, Protein structure comparison by alignment of distance matrices, J Mol Biol (), p [7] Shindyalov IN, Bourne PE, Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Protein Engineering, 998, vol. (9), p [8] Pearson. W.R, Rapid and sensitive sequence comparison with FASTP and FASTA, Methods Enzymol, 990, vol.83, p [9] S. F. Altschul, W. Gish, W. miller, E. W. Myers and D. J. Lipman, Basic Local Alignment Search Tool, J. Mol. Biol , p [20] Gribskov, M., McLachlan, A.D., and Eisenberg, D, Profile analysis: Detection of distantly related proteins, Proc. Natl. Acad. Sci, 987 vol.84, p alignment of large genomic sequences, Bioinformatics 2003, vol.4, p [5] C. Notredame, D. Higgins, J. Heringa, T-Coffee: A novel method for multiple sequence alignments, J Mol Biol, 2000, vol.302, p [6] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z. and et al, Gapped BLAST and PSI-BLAST: A new generation of database programs, Nucleic Acids Res, 997, vol.25, p [7] SR Eddy, Profile hidden markov models, Bioinformatics, 998, Vol 4, p [8] Fa Zhang, Jingchun Chen, Zhiyong Liu and Bo Yuan, The construction of Structural Templates for the Modeling of Conserved Protein Domains, International Conference on Bioinformatics and its Applications(ICBA 04), Fort Lauderdle. Florida. USA. [9] Mulder, N.J., et al., InterPro, progress and status in Nucleic Acids

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES Eser Aygün 1, Caner Kömürlü 2, Zafer Aydin 3 and Zehra Çataltepe 1 1 Computer Engineering Department and 2

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Optimization of a New Score Function for the Detection of Remote Homologs

Optimization of a New Score Function for the Detection of Remote Homologs PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

MSAT a Multiple Sequence Alignment tool based on TOPS

MSAT a Multiple Sequence Alignment tool based on TOPS MSAT a Multiple Sequence Alignment tool based on TOPS Te Ren, Mallika Veeramalai, Aik Choon Tan and David Gilbert Bioinformatics Research Centre Department of Computer Science University of Glasgow Glasgow,

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Some Problems from Enzyme Families

Some Problems from Enzyme Families Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems

More information

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon A Comparison of Methods for Assessing the Structural Similarity of Proteins Dean C. Adams and Gavin J. P. Naylor? Dept. Zoology and Genetics, Iowa State University, Ames, IA 50011, U.S.A. 1 Introduction

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity 1 frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity HUZEFA RANGWALA and GEORGE KARYPIS Department of Computer Science and Engineering

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Multiple sequence alignment

Multiple sequence alignment Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs

Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs PROTEINS: Structure, Function, and Genetics 52:446 453 (2003) Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs Bin Qian 1 and Richard A. Goldstein 1,2 * 1 Biophysics Research Division, University

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

SEQUENCE alignment is an underlying application in the

SEQUENCE alignment is an underlying application in the 194 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific

More information

Chapter 7: Rapid alignment methods: FASTA and BLAST

Chapter 7: Rapid alignment methods: FASTA and BLAST Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul

More information

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Why Look at More Than One Sequence? 1. Multiple Sequence Alignment shows patterns of conservation 2. What and how many

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics. Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu ----------------------------------------------------------------- Bond

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Research Proposal Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Name: Minjal Pancholi Howard University Washington, DC. June 19, 2009 Research

More information

Protein Structure Prediction

Protein Structure Prediction Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on

More information

Efficient Remote Homology Detection with Secondary Structure

Efficient Remote Homology Detection with Secondary Structure Efficient Remote Homology Detection with Secondary Structure 2 Yuna Hou 1, Wynne Hsu 1, Mong Li Lee 1, and Christopher Bystroff 2 1 School of Computing,National University of Singapore,Singapore 117543

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Structural Alignment of Proteins

Structural Alignment of Proteins Goal Align protein structures Structural Alignment of Proteins 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Do Aligned Sequences Share the Same Fold?

Do Aligned Sequences Share the Same Fold? J. Mol. Biol. (1997) 273, 355±368 Do Aligned Sequences Share the Same Fold? Ruben A. Abagyan* and Serge Batalov The Skirball Institute of Biomolecular Medicine Biochemistry Department NYU Medical Center

More information

Similarity searching summary (2)

Similarity searching summary (2) Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity

More information

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing Bioinformatics Proteins II. - Pattern, Profile, & Structure Database Searching Robert Latek, Ph.D. Bioinformatics, Biocomputing WIBR Bioinformatics Course, Whitehead Institute, 2002 1 Proteins I.-III.

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

K-means-based Feature Learning for Protein Sequence Classification

K-means-based Feature Learning for Protein Sequence Classification K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract

More information

Protein sequence alignment with family-specific amino acid similarity matrices

Protein sequence alignment with family-specific amino acid similarity matrices TECHNICAL NOTE Open Access Protein sequence alignment with family-specific amino acid similarity matrices Igor B Kuznetsov Abstract Background: Alignment of amino acid sequences by means of dynamic programming

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. The fully-formatted PDF version will become available shortly after the date of publication, from the

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Combining pairwise sequence similarity and support vector machines for remote protein homology detection Combining pairwise sequence similarity and support vector machines for remote protein homology detection Li Liao Central Research & Development E. I. du Pont de Nemours Company li.liao@usa.dupont.com William

More information

Truncated Profile Hidden Markov Models

Truncated Profile Hidden Markov Models Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 11-1-2005 Truncated Profile Hidden Markov

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations Sequence Analysis and Structure Prediction Service Centro Nacional de Biotecnología CSIC 8-10 May, 2013 Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations Course Notes Instructor:

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

Segment-based scores for pairwise and multiple sequence alignments

Segment-based scores for pairwise and multiple sequence alignments From: ISMB-98 Proceedings. Copyright 1998, AAAI (www.aaai.org). All rights reserved. Segment-based scores for pairwise and multiple sequence alignments Burkhard Morgenstern 1,*, William R. Atchley 2, Klaus

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Combining pairwise sequence similarity and support vector machines for remote protein homology detection Combining pairwise sequence similarity and support vector machines for remote protein homology detection Li Liao Central Research & Development E. I. du Pont de Nemours Company li.liao@usa.dupont.com William

More information

PROTEIN CLUSTERING AND CLASSIFICATION

PROTEIN CLUSTERING AND CLASSIFICATION PROTEIN CLUSTERING AND CLASSIFICATION ori Sasson 1 and Michal Linial 2 1The School of Computer Science and Engeeniring and 2 The Life Science Institute, The Hebrew University of Jerusalem, Israel 1. Introduction

More information

Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical Engg. and Computer Science

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

The PRALINE online server: optimising progressive multiple alignment on the web

The PRALINE online server: optimising progressive multiple alignment on the web Computational Biology and Chemistry 27 (2003) 511 519 Software Note The PRALINE online server: optimising progressive multiple alignment on the web V.A. Simossis a,b, J. Heringa a, a Bioinformatics Unit,

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS Aslı Filiz 1, Eser Aygün 2, Özlem Keskin 3 and Zehra Cataltepe 2 1 Informatics Institute and 2 Computer Engineering Department,

More information

proteins Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * INTRODUCTION

proteins Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * INTRODUCTION proteins STRUCTURE O FUNCTION O BIOINFORMATICS Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * 1 Department of Biological Sciences, College

More information

Goals. Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions

Goals. Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions Jamie Duke 1,2 and Carlos Camacho 3 1 Bioengineering and Bioinformatics Summer Institute,

More information

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm017

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm017 Vol. 23 no. 7 2007, pages 802 808 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm017 Sequence analysis PROMALS: towards accurate multiple sequence alignments of distantly related proteins

More information

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Sequence Analysis and Databases 2: Sequences and Multiple Alignments 1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es) 2 Sequence Comparisons:

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information