Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Size: px
Start display at page:

Download "Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis"

Transcription

1 Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Kumud Joseph Kujur, Sumit Pal Singh, O.P. Vyas, Ruchir Bhatia, Varun Singh* Indian Institute of Information Technology - Allahabad, India * rickky27@rediffmail.com ABSTRACT There exists various techniques/ algorithms for pairwise sequence alignment, multiple sequence alignment and entropy. The technical paper presented here is a consequence of study and implementation of these techniques on DNA to Protein Translation, Sequence Alignment and Comparison, Multiple Sequence Alignment and entropy. The DNA to Protein Translation is performed by detecting open reading frame (ORF) while taking a DNA coding sequence(cds) as an input. This sequence is then converted into aminoacids taking 3 nucleotides(codons) at a time. Each codon specifies an amino acid. 3 frames for this sequence are considered by shifting one position and then taking the codons. The other 3 frames are for the complementary sequence. These codons are checked for start and stop codons which mark the possible protein, longest of which gives the final protein. In pairwise sequence alignment and comparison, the input query sequence which is a primary protein sequence is compared to the various subject sequences that exist in the database. Local alignment and global alignment are the techniques used for sequence alignment. For local alignment Smith Waterman algorithm and for Global alignment, Needleman Wunsch algorithms were implemented. Various scoring techniques can be used including PAM and BLOSUM. Both the above techniques for comparison and alignment have been implemented. Both the techniques are discussed in details later. Pairwise comparison is fundamental to sequence analysis. However, analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The process of multiple alignments can be regarded as an exercise in enhancing the signal-to-noise ratio with a set of sequences, which ultimately facilitates the elucidation of biologically significant motifs. Entropy analysis is done to detect the coding and noncoding regions in a DNA sequence. Sliding window method and recursive segmentation method are applied to calculate the Jensen- Shannon Divergence. This helps in distinguishing the homogeneous segments in a heterogeneous DNA sequence. Keywords: Pairwise alignment, Multiple alignment, Phylogeny, Entropy, DNA sequences. PROBLEM DEFINITION AND SOLUTION 1. DNA TO PROTEIN TRANSLATION

2 The conversion of a DNA to Proteins is carried out using Replication, Transcription and Translation. Transcription Translation After transcription, the next process is to remove the garbage or the unnecessary parts called the introns so that the useful part i.e. the exons get concatenated. This is the Coding sequence (CDS) After getting the CDS we start from the first nucleotide and form triplets(codon) till the end. Given a CDS, and knowing the genetic code, it is possible to translate the DNA into protein by looking up successive codons in a genetic code table. However this is only one case, the other cases being when we shift the starting nucleotide by one and two and similarly for its complement. Thus we would have six frames in all, that give us six different options for proteins, but the correct protein is the one which has the longest length out of this six frames starting from a start and ending at the stop codon. [1] Detecting open reading frames: ORF is normally deemed to be the longest frame uninterrupted by a stop codon. Finding an end of the ORF is easier than finding its beginning. Usually, the initial codon in the CDS is that for methionine (ATG); but methionine is also a common residue within the CDS, so its presence is not an absolute indicator of ORF initiation. Several features may be used as indicators of potential protein coding regions in DNA. One of these is sufficient ORF length (based on the premise that long ORFs rarely occur by chance). Recognition of flanking Kozak sequences (CCGCCATGG) may also be helpful in pinpointing the start of the CDS. [1] Understanding the effect of introns and exons: The genes of eukaryotic are connected by regions that contribute towards the CDS, known as exons, and those do not, known as introns. Once consequence of the presence of exons and introns in eukaryotic genes is that potential gene products can be of different length, because not all exons may be represented in the final transcribed mrna. The whole process involved in the DNA-to-Protein translation can be described as below: Query sequence (DNA): ACATGAGTCGTACGTAGCTGACTGATCGT Six frame Amino-acid translation: Forward 0:T#VVRS#LI Forward 1:HESYVAD#S Forward 2: *SRT#LTDR Reverse 0:TISQLRTTH Reverse 1:RSVSYVRL* Reverse 2:DQSATYDSC The start codon is represented by * and the stop codon by #. Hence from the above six-frames we can conclude that the protein is generated from the forward 2 translation and the possible protein is : *SRT# with length 5. After getting this possible protein, the technique of alignment and comparison is applied on it with respect to other protein sequences present in the databases. If we get an identical match or high

3 similarity, then protein produced is related to a known gene family. And if not, the protein sequence transcribed shows the existence of some new gene family. 2. PAIRWISE SEQUENCE ALLIGNMENT Pair-wise alignment is a fundamental process in sequence analysis, carried out to find the relationship based on sequence properties of any two sequences, may be protein, DNA or RNA. This section describes the comparison of two sequences, a query sequence (the properties of which need to be determined) and a subject sequence (the properties of which are already known) by searching the series of individual characters or character patterns that are in the same order in these sequences. This further helps in the identifying any similarity (similar in functionality) or evolutionary (homology) relationship existing between the query sequence and a family of known genes. We need to calculate the correct alignment as it is required to find which segment of the gene is altered(may be in the form of point mutation, insertion, deletion, duplication etc). There are two types of sequence alignment techniques, local and global. In global alignment, an attempt is made to align the entire sequence, using as many characters as possible, up to both ends of each sequence. Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. In local alignment, stretches of sequences with the highest density of matches are aligned, thus generating one or more islands of matches or sub alignments in the aligned sequence. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain. The first deals with similarity across the entire length of the sequences and the second one on the regions of similarity in the parts of the sequences (subsequences). It is important to understand the difference between these two alignments as the sequences are not uniformly similar or identical. Thus there is no point in performing a global alignment that just has local similarity. Let us now discuss each of these techniques in details. SIMILARITY AND IDENTITY As we know that not only the identity but also the similarity is biologically significant. Many of the amino acids can be replaced or substituted by another one of same chemical properties and the substituted amino acid remains compatible with protein structure and function. Hence we can also take into account the different scoring matrices (for eg. PAM or BLOSUM). These scoring matrices provide different scores to all the matches depending on similarity/dissimilarity. Some scoring matrices are superior to others at finding related proteins based on either sequences or structures. BLOSUM matrices take into account the full range of amino acid substitution in families of related families. The other matrix, PAM s are based on variation in closely related proteins that are extrapolated to produce matrices for more distantly related proteins. GLOBAL ALIGNMENT: As already discussed this alignment technique takes care of finding the similarity across full length. The algorithm used here is Needleman-Wunsch algorithm (name given after the scientists who proposed the algorithm) based on the theory of dynamic programming. The whole algorithm consists of mainly three parts : Here for example we have consider two sequences 1) We form a matrix representation of the given query sequence and the subject sequence by placing them along the margins of the matrix. This matrix is a unitary matrix, that weights identical elements with value 1 and the rest with value 0. We can score them according to the scoring matrix also.

4 Table: Initial setup for Needleman-Wunsch 2) The next step is to trace a score to all the pathways. Here we start from the bottom right and end up at the upper left of the matrix. One can trace it in vice-versa fashion also. The scoring(matrix fill up process) is done as: M(i,j) = M(i,j) + max [ M(k,j+1), M(i+1,l)] Where, k is an integer greater than i l is an integer greater than j Table: Half way through the second step 3) The final step is to trace back the whole path. The trace back starts from the highest value (in this case the top leftmost element). The alignment is traced

5 proceeding left to right, top to bottom choosing the largest numbers available. Table: Trace the alignment LOCAL ALIGNMENT: The Needleman-Wunsch algorithm works well for sequences that show similarity across the full length. However the sequences that are distantly related to each other might show small regions of local similarity rather than across the full length. The Smith-Waterman algorithm takes handles this problem quite efficiently. It follows the same initial matrix based technique as used in the case of Needleman-Wunsch algorithm. The main difference between these two algorithm lies on the point that, in Smith-Waterman case, each element in the matrix defines the end point of the potential alignment(any element of the matrix can have the highest value not necessarily the the terminal end). Only minimal changes to the Needleman-Wunsch algorithm are required. These are: 1) A negative score/weight must be given to the mismatches, if any negative score would result, then zero is substituted. Score at any matrix point is given by Sij = max { Si-1,j + s(aibj), max ( Si-x,j - Wx), x>=1 max (Si,j-y - Wy) y>=1 } where Sij is the score at position i in sequence A and position j in sequence B, s(aibj) is the score for aligning the characters at position i and j, Wx is the gap penalty for a gap of length x in sequence A and Wy is the penalty for a gap of length y in sequence B. 2) As explained above, the beginning and the end of an optimal path may be found anywhere in the matrix and not only the endpoints. In this example penalty for mismatch is 0.5 and gap penalty is 0.

6 Table: Smith-Waterman example GAP AND GAP PENALTY The inclusion of gaps and gap penalty is necessary in order to obtain the best optimal alignment between any two sequences. The gaps are the result of the changes (mutation) occurring in the particular sequence during evolution. So our job is to allow the gaps in right position of the sequence to get a meaningful result. A gap penalty is the combination of both the gap opening penalty and the gap extension penalty. The summarization of the gap penalty in a sequence can be given as: W(penalty) = (No. of gaps originated)*g(opening penalty) + (gap length)*g(extension penalty) The values of these penalties are chosen in such a way that it shouldn t disturb the overall balance. If gap penalty is too high as compared to matrix scores the gaps will never appear in the alignments. On the other hand if the gap penalty is too low as compared to matrix scores, gaps will appear everywhere in the alignment in order to align as many of the same characters as possible. 3. MULTIPLE SEQUENCE ALIGNMENT Pairwise comparison is fundamental to sequence analysis. However, analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The process of multiple alignments can be regarded as an exercise in enhancing the signal-to-noise ratio with a set of sequences, which ultimately facilitates the elucidation of biologically significant motifs. The goal of multiple sequence alignment is to generate a concise, information-rich summary of sequence data in order to inform decision-making on the relatedness of the sequences to a gene family. Sometimes, indeed, multiple alignments may be used to express the dissimilarity between a set of sequences. Alignments should be regarded as models that can be used to test the hypothesis. As in pairwise alignment, there is nothing inherently correct, or incorrect, about any particular pairwise alignment, the same maxim holds for multiple alignments.[2]

7 Definition of multiple sequence alignment: Here, a small alignment of 5 short sequences (I- V) is presented. The sequences have been arranged so that the most similar residues are brought into vertical register, through the use of gaps, while the order of residues in each sequence is preserved. MULTIPLE SEQUENCE ALIGNMENTS -- USES Just as the alignment of the pair of nucleic acid or protein sequences can reveal whether or not there is an evolutionary relationship between the sequences, so can the alignment of three or more sequences reveal relationships among multiple sequences. Multiple sequence alignments of a set of sequences can provide information as to the most alike regions in the set. In proteins, such regions may represent conserved functional or structural domains. If the structure of one or more members of the alignment is known, it may be possible to predict which amino acids occupy the same spatial relationship in other proteins in the alignment. In nucleic acids, such alignments also reveal structural and functional relationships. For example, aligned promoters of a set of similarly regulated genes may reveal consensus binding sites for regulatory proteins. Consensus Another use for consensus information retrieved from a multiple sequence alignment is for the prediction of specific probes for other members of the same group or family of similar sequences in the same or other organisms. There are both computer and molecular biological applications. Once a consensus pattern has been found, database searching programs may be used to find other sequences with a similar pattern. MULTIPLE SEQUENCE ALIGNMENT (MSA) TO PHYLOGENETIC ANALYSIS -- RELATIONSHIP Once the MSA has been found, the number or types of changes in the aligned sequence residues may be used for a phylogenetic analysis. The alignment provides a prediction as to which sequence characters correspond. Each column in the alignment predicts the mutation that occurs at one side during the evolution of the sequence family as illustrated in figure 1. Within the column are original characters that were present early, as well as other derived characters that appeared later in evolutionary time. In some cases the position is so important for function that mutational changes are not observed. It is these conserved positions that are

8 useful for producing an alignment. In other cases, the position is less important, and substitutions are observed. Deletions and insertions may also be present in some regions of the alignment. Thus, starting with the alignment, one can hope to dissect the order of appearance of the sequences during evolution. Seq A A. N Q P Seq B A. N -- P Seq C A R Y Q P Seq D A. Y Q P Figure 1: The close relationship between MSA and evolutionary tree construction. Shown is a short section of one MSA of four protein sequences including conserved and substituted positions, insertion (of R) and a deletion (of Q). PROGRESSIVE GLOBAL ALIGNMENT Pairwise alignment technique can be extended for aligning multiple sequences. But the number of sequences that can be aligned is limited because the number of computational steps and the amount of memory required grow exponentially with the number of sequences to be analyzed. Progressive alignment is the most commonly used method to align biological sequences. This heuristic approach is very rapid, requires low memory space and offers good performance on relatively well-conserved, homologous sequences. Description of progressive alignment methods: Progressive alignment consists of building a multiple alignment using pair wise alignments in three steps: a) Compute the alignment scores (or distances) between all pairs of sequences. b) Build a guide tree that reflects the similarities between sequences, using the pair wise alignment distances (as in Figure 1). c) Align the sequences following the guide tree. Corresponding to each node in the tree, the alignment aligns the two sequences or alignments that are associated with its daughter nodes. The process is repeated beginning from the tree leaves (the sequences) and ending with tree root. The problem with this progressive alignment stems from the greedy nature of the algorithm: any mistake that appears during early alignments cannot be corrected later as new sequence information is added.

9 4. SIGNIFICANCE OF SEQUENCE ALIGNMENT Sequence alignment is useful for discovering functional, structural, and evolutionary information in biological sequences. We have to get the best possible (optimal) alignment to discover this information. Sequences which are very much similar or alike, probably have the same functions, be it in some regulatory role in the case of similar DNA molecules, or a similar biochemical function and three dimensional structure in case of protein. In addition, if two sequences from different organisms are similar, then there is a possibility of having a common ancestor shared by these sequences. In this case the sequences are defined as homologous. The alignment technique indicates the changes (mutations) that could have occurred between the two homologous sequences and a common ancestor sequence during evolution. Hence one can easily find out the whereabouts of a new sequence that is occurred from these mutational changes. In other cases, similar regions in sequences may not have a common ancestor but might have arisen independently by two evolutionary pathways converging on the same function, called convergent evolution. 5. ENTROPY ANALYSIS The entropy is measured in linear time as the number of distinctive segments occurring in the regions. It helps in locating out the range of borders between coding and non-coding regions of any gene The entropic segmentation process partitions a heterogeneous DNA sequence into homogeneous subsequences, which we term compositional domains. If we accept a domain picture of DNA sequences, it is natural to design computational approaches that segment a DNA sequence into homogeneous domains, and computer algorithms that accomplish such a segmentation are commonly called segmentation algorithms. Two well-known examples of segmentation algorithms are the one based on hidden Markov model by Churchill and walking Markov model algorithm by Fickett et al. (1992). In the biology community, however, most people still use the old-fashioned moving window approach. One advantage of the widely-used sliding-window methods is that their implementation is straightforward: one calculates the density of a sequence feature of interest within a window, moves the window along the sequence, and recalculates the density. However, the choice of the window size and the moving distance are, in general, arbitrary. If the window size is too large, local fluctuations that contain significant biological information may be averaged out. If the moving distance is too long, one domain can be split between two windows and its distinctive feature may not be revealed. There are also some other drawbacks of moving window approach. Another approach for detection of coding and non-coding borders is Recursive segmentation.[5] Detection of coding-noncoding borders The coding potential measurement is obtained from within a coding or non-coding region (as versus from their borders). Such measurement can either be learned from the data or can be based on a known biological knowledge. However, the current biological knowledge about coding potential is still mainly limited to that of the codon structure. The fact that coding regions, and not the noncoding region, consists of three-base unit, plus the fact that these units are not used with equal probability, provides a strong signal for coding potential.[4]

10 Jensen Shannon divergence was implemented to do the entropy analysis. Jensen Shannon Divergence using Sliding Window Method For a given DNA sequence of length N, our code calculates Jensen Shannon divergence for multiplicities of step. First it determines the total number of purines (A & G) and pyrimidines (C & T) in our query DNA sequence, then calculates the entropy for the whole sequence i.e. H(W) Entropy (H) = -p*log2(p) - q*log2(q) [4] Where p = pur/size and q = pyr/size LOG2(q)=logq/log2 Second it computes Jensen Shannon divergence at segments U = 1,,i*step and V = i+step+1,.,n Where U is the left segment and V the right segment Move from left to right in steps Determine number of purines and pyrimidines in the left (U) and right (V) segment and calculate entropy for each one of them i.e. H(U) and H(V) Then calculate the divergence for that step by the given formula Divergence = HW -nu/n * HU - nv/n * HV nu is the length of left sequence and nv of right sequence, n is the length of whole sequence. High divergence indicates that the left and right segments are more homogeneous with respect to themselves than with respect to the whole. Example: Consider that the step size for calculating Jensen Shannon divergence is 20. now the results for a query sequence are given below Query Sequence: TCCATTGAGCCTTATACCAGTAACATCTACACTCGAAGATCTTGTCAGGGGAATTTCAGATTG TGAATCCTCACTTACTGAAAGATCTTACTGAGCGGGG FOR THE WHOLE SEQUENCE Purine: 49 Purimidine: 51 Genome Length: 100 Entropy H(W): AFTER STEP 1 Length of left segment(nu) = 20 Length of right segment(nv) = 80 Purines in U = 8 Pyrimidines in U = 12 Purines in V = 41 Pyrimidines in V = 39 DIVERGENCE = AFTER STEP 2 Length of left segment(nu) = 40 Length of right segment(nv) = 60 Purines in U = 18 Pyrimidines in U = 22 Purines in V = 31 Pyrimidines in V = 29

11 DIVERGENCE = AFTER STEP 3 Length of left segment(nu) = 60 Length of right segment(nv) = 40 Purines in U = 29 Pyrimidines in U = 31 Purines in V = 20 Pyrimidines in V = 20 DIVERGENCE = AFTER STEP 4 Length of left segment(nu) = 80 Length of right segment(nv) = 20 Purines in U = 3 Pyrimidines in U = 44 Purines in V = 13 Pyrimidines in V = 7 DIVERGENCE = Jensen Shannon Divergence using Recursive Segmentation For a sequence of length N, we calculate at each position i (0<i<N) the entropy HW of the whole sequence, the entropy HU of the subsequence on the left side of the partition point, and the entropy HV of the subsequence on the right side of the partition point, then calculate Shannon divergence. As a measure of the heterogeneity of the sequence we choose the maximized Jensen-Shannon divergence say maxdjs. If this is large enough, we say that the sequence is heterogeneous and should be segmented. We recursively apply the same procedure to both the left and the right subsequence, as long as maxdjs falls below that given threshold, the recursion along the current path is stopped. This recursive segmentation procedure is very similar to the procedure of growing a binary tree. When the segmentation is continued, two branches of the tree are generated; if it is stopped, that branch becomes a leaf.[5] CONCLUSION AND DISCUSSION The paper mainly talks about the implementation aspect of basic existing algorithms and techniques used in sequence analysis such as Needleman-Wunsch, Smith-Waterman, progressive methods for MSA, Jensen-Shannon divergence, entroypy, translation of DNA into protein. The Needlman-Wunsch and Smith-Waterman algorithm (examples of Dynamic programming) are considered to be highly accurate in finding out the optimal output but are relatively slow when large chunks of data are taken into account. Dynamic programming has time and space complexity of O(n^2) in the case of pairwise alignment, where n is the total length of the sequence. If we generalize this algorithm for multiple sequence alignments, then we are adding an extra dimension for each new sequence. Thus, the complexity becomes O(n^d) where d accounts for the number of sequences being added up. New methods comprising of dynamic programming, added with a heuristic approach, are coming into the picture. These are much faster than the existing basic algorithms. The topic of entropy is an open area and research works are still going on in order to get accurate results for coding and non-coding regions. Recursive segmentation is an interesting alternative to the traditional moving window approach. Admittedly, the moving window approach is simple, fast (O(N) computational complexity vs. the O(N log(n)) complexity for recursions), and usually provides an answer to questions of interest to investigators. Nevertheless, recursive

12 segmentation approach can be more accurate; and it also avoids the common problem in a moving window approach to select a window size and a moving distance. We suggest to use recursive segmentation as a refinement of the moving window approach, or a second-stage analysis after a rough result is obtained from moving window approach. BIBLIOGRAPHY Books 1) Introduction to bioinformatics - T K Attwood, D J Parry-Smith, Pearson Education Asia. 2) Bioinformatics: Sequence and Genome Analysis - David W Mount. 3) Developing Bioinformatics Computer Skills - Cynthia Gibbs, Per Jambeck, O Reilly Publications. Papers 4) Pedro Bernaola-Galván, Ivo Grosse, Pedro Carpena, José L. Oliver, Ramón Roldán, and H. Eugene Stanley. Finding Borders Between Coding And Non-coding DNA regions by an entropic segmentation method.(aug 1999) 5) Wentian Li, Pedro Bernaola-Galva n, Fatameh Haghighi, Ivo Grosse.Application of recursive segmentation to the analysis of DNA sequences.(nov 2001) 6) Aaron Davidson. A fast pruning algorithm for optimal sequence alignment.

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists A greedy, graph-based algorithm for the alignment of multiple homologous gene lists Jan Fostier, Sebastian Proost, Bart Dhoedt, Yvan Saeys, Piet Demeester, Yves Van de Peer, and Klaas Vandepoele Bioinformatics

More information

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION Harmandeep Singh 1, Er. Rajbir Singh Associate Prof. 2, Navjot Kaur 3 1 Lala Lajpat Rai Institute

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

Moreover, the circular logic

Moreover, the circular logic Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

More information

Multiple Choice Review- Eukaryotic Gene Expression

Multiple Choice Review- Eukaryotic Gene Expression Multiple Choice Review- Eukaryotic Gene Expression 1. Which of the following is the Central Dogma of cell biology? a. DNA Nucleic Acid Protein Amino Acid b. Prokaryote Bacteria - Eukaryote c. Atom Molecule

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Computational Molecular Biology (

Computational Molecular Biology ( Computational Molecular Biology (http://cmgm cmgm.stanford.edu/biochem218/) Biochemistry 218/Medical Information Sciences 231 Douglas L. Brutlag, Lee Kozar Jimmy Huang, Josh Silverman Lecture Syllabus

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17: Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

1. In most cases, genes code for and it is that

1. In most cases, genes code for and it is that Name Chapter 10 Reading Guide From DNA to Protein: Gene Expression Concept 10.1 Genetics Shows That Genes Code for Proteins 1. In most cases, genes code for and it is that determine. 2. Describe what Garrod

More information

Lecture 5: September Time Complexity Analysis of Local Alignment

Lecture 5: September Time Complexity Analysis of Local Alignment CSCI1810: Computational Molecular Biology Fall 2017 Lecture 5: September 21 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

GCD3033:Cell Biology. Transcription

GCD3033:Cell Biology. Transcription Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Organic Chemistry Option II: Chemical Biology

Organic Chemistry Option II: Chemical Biology Organic Chemistry Option II: Chemical Biology Recommended books: Dr Stuart Conway Department of Chemistry, Chemistry Research Laboratory, University of Oxford email: stuart.conway@chem.ox.ac.uk Teaching

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi Bioinformatics Sequence Analysis An introduction Part 8 Mahdi Vasighi Sequence analysis Some of the earliest problems in genomics concerned how to measure similarity of DNA and protein sequences, either

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -

More information

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

More information

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12 Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Introduction to Bioinformatics Pairwise Sequence Alignment Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Outline Introduction to sequence alignment pair wise sequence alignment The Dot Matrix Scoring

More information

Chapter 17. From Gene to Protein. Biology Kevin Dees

Chapter 17. From Gene to Protein. Biology Kevin Dees Chapter 17 From Gene to Protein DNA The information molecule Sequences of bases is a code DNA organized in to chromosomes Chromosomes are organized into genes What do the genes actually say??? Reflecting

More information

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT Inferring phylogeny Constructing phylogenetic trees Tõnu Margus Contents What is phylogeny? How/why it is possible to infer it? Representing evolutionary relationships on trees What type questions questions

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information