Introduction to Bioinformatics

Similar documents
Introduction to Bioinformatics Introduction to Bioinformatics

Introduction to Bioinformatics Online Course: IBT

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Ch. 9 Multiple Sequence Alignment (MSA)

Using Bioinformatics to Study Evolutionary Relationships Instructions

Algorithms in Bioinformatics

Comparing whole genomes

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics and BLAST

Week 10: Homology Modelling (II) - HHpred

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

BLAST. Varieties of BLAST

Sequence analysis and comparison

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Introduction to Bioinformatics Introduction to Bioinformatics

Large-Scale Genomic Surveys

Tools and Algorithms in Bioinformatics

Alignment & BLAST. By: Hadi Mozafari KUMS

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Sequence Alignment Techniques and Their Uses

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Heuristic Alignment and Searching

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

An Introduction to Sequence Similarity ( Homology ) Searching

Bioinformatics Exercises

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Single alignment: Substitution Matrix. 16 march 2017

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Introduction to Bioinformatics

Multiple sequence alignment

Quantifying sequence similarity

Protein function prediction based on sequence analysis

Journal of Proteomics & Bioinformatics - Open Access

Hands-On Nine The PAX6 Gene and Protein

Bioinformatics for Biologists

CS612 - Algorithms in Bioinformatics

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

G4120: Introduction to Computational Biology

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

In-Depth Assessment of Local Sequence Alignment

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Pairwise sequence alignments

Basic Local Alignment Search Tool

Session 5: Phylogenomics

Molecular Evolution and DNA systematics

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genomics and bioinformatics summary. Finding genes -- computer searches

Overview Multiple Sequence Alignment

Computational Biology

Tools and Algorithms in Bioinformatics

Open a Word document to record answers to any italicized questions. You will the final document to me at

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Introduction to protein alignments

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Sequence analysis and Genomics

Investigating Evolutionary Questions Using Online Molecular Databases *

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Comparative genomics: Overview & Tools + MUMmer algorithm

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

EBI web resources II: Ensembl and InterPro

SUPPLEMENTARY INFORMATION

Tree Building Activity

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Pairwise & Multiple sequence alignments

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Sequence Database Search Techniques I: Blast and PatternHunter tools

Collected Works of Charles Dickens

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Introduction to Evolutionary Concepts

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

BIOINFORMATICS LAB AP BIOLOGY

O 3 O 4 O 5. q 3. q 4. Transition

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Supplemental Materials

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

EECS730: Introduction to Bioinformatics

Moreover, the circular logic

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Preparing a PDB File

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Whole Genome Alignments and Synteny Maps

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Similarity searching summary (2)

Effects of Gap Open and Gap Extension Penalties

Emily Blanton Phylogeny Lab Report May 2009

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Synteny Portal Documentation

Multiple Sequence Alignment

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Tutorial 4 Substitution matrices and PSI-BLAST

Transcription:

Introduction to Bioinformatics http://1.51.212.243/bioinfo.html Dr. rer. nat. Jing Gong Cancer Research Center School of Medicine, Shandong University 211.1.12

Chapter 3 Alignment

Similarity Searches on Sequence Databases In the game of Mahjong Titans, you want to find the same symbol from a collection of symbols for a certain one. What you can do is to compare the symbol with every one, with your eyes.

Similarity Searches on Sequence databases For a protein or DNA sequence, it means finding a similar one from a collection of sequences. It is impossible to compare every pair in the biological databases with your eyes, because there are too many sequences. BLAST > 1,

The Importance of Similarity Similar sequences often derive from a common ancestral sequence. They probably share similar structure and biological function. You can infer something you know about a particular DNA or protein sequence to all similar DNA or protein sequences. Similar structures Similar functions Similar sequences

The Importance of Similarity Similar sequences often derive from a common ancestral sequence. They probably share similar structure and biological function. You can infer something you know about a particular DNA or protein sequence to all similar DNA or protein sequences. Similar structure? Similar function? Brothers?

Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a nucleotide. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of similar residues relative to their length. Who and who are similar, who and who not? They are defined by a matrix, such as BLOSUM. My name is Lampy.

Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a DNA. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of similar residues relative to their length. Who and who are similar, who and who not? They are defined by a matrix, such as BLOSUM. seq 1 : CLHK seq 2 : CIHL Identity = 2/4 = 5% Similarity = 3/4 = 75%

Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a DNA. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of similar residues relative to their length. Who and who are similar, who and who not? They are defined by a matrix, such as BLOSUM. What happens when two sequences have different lengths? seq 1 : CLHKA seq 2 : CIHL Identity? Similarity?

Identity and Similarity Homologous: In general, if two protein sequences have an identity of 25%, or two DNA sequences have an identity of 7%, they can be regarded as homologous. However, Nothing is sure about the meaning of observed similarity. Some protein sequences are less than 15% identical, but they have the same 3D structure, while some are 25% identical, but they have different structures. Homology or non-homology is never granted. The 25% cutoff is mostly a common-sense indicator. In most cases, to make sure whether two sequences are true homologous, you need to consider many other things. Homology is a binary relationship: yes or no; similarity is a quantifiable property: %-1%.

The Most Popular Search Tool: BLAST BLAST (Basic Local Alignment Search Tool) A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. Different kinds of BLAST: BLASTn: Search a nucleotide database using a nucleotide query. BLASTp: Search protein database using a protein query. BLASTx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query. tblastx: Search translated nucleotide database using a translated nucleotide query. Translated nucleotide: A nucleotide sequence translated into six proteins according to the six open reading frames (ORF, in prokaryotes).

Nucleotide Databases Reading into Genes and Genomes Reading Frame - breaking a DNA sequence into three letter codons which can be translated in amino acids. ORF x 3 x 3 = x 6 reading frames ORF (Open Reading Frame) - a DNA sequence that contains a start codon but does not contain a stop codon in a given reading frame. ATG Met (M) TAA TAG TGA

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://www.ncbi.nlm.nih.gov/

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://blast.ncbi.nlm.nih.gov

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://blast.ncbi.nlm.nih.gov blast.fasta

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://1.51.212.243/bioinfo.html

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://blast.ncbi.nlm.nih.gov blast.fasta BLAST another sequence at the same time give a name to your job query only a part of your sequence

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://blast.ncbi.nlm.nih.gov select in which database you want to search

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://blast.ncbi.nlm.nih.gov type which species you want to search, e.g. human select algorithm

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://blast.ncbi.nlm.nih.gov Part 1 : a brief summary

The Most Popular Search Tool: BLAST The NCBI BLAST server Part 2 : graphic summary http://blast.ncbi.nlm.nih.gov Part 1 : a brief summary This figure illustrates the sequence length and classification of the input protein. an overview of similar sequences

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://blast.ncbi.nlm.nih.gov Part 3 : descriptions go to the corresponding database entry go to the alignment between your query sequence and the matching sequence

The Most Popular Search Tool: BLAST The NCBI BLAST server : BLASTp http://blast.ncbi.nlm.nih.gov Part 4 : Alignment

Upgraded BLAST: PSI-BLAST Sometimes BLAST is not enough. For instance, you want to catch all the members of a very large protein family, starting with one sequence that you have. When running BLAST, you catch only the most closely related sequences. The other distant members would not be found. In other words, you find your direct friends, but the friends of your friends are missed. PSI (Position-Specific Iterated)-BLAST first looks for sequences that are closely related to yours; and then, gradually, it extends the circle of friends to include sequences that are distantly related. - How does PSI-BLAST extend the circle of friends? - A Position-Specific Weight Matrix and Iterations.

Position-Specific Weight Matrix Seq1: A B C D Seq2: B B C D Seq3: A C C D Seq4: A B D D 1 2 3 4 A 75% B 25% 75% C 25% 75% D 25% 1% A Position-Specific Weight Matrix describes the letter distribution of each position (column) for a family of sequences. The distributions can be presented as probabilities or other statistic values.

Upgraded BLAST: PSI-BLAST The first round of search (first iteration) of PSI-BLAST is just like BLAST. All closely related sequences BBCD, ACCD and ABDD that have one different letter are found for the query sequence ABCD, but BCCD that has two different residues is missed. Then, a Position-Specific Weight Matrix is made for ABCD, BBCD, ACCD and ABDD. This matrix is used in the second round of search (second iteration). Since BCCD matches the matrix, now it is found. And then, a second matrix is made for ABCD, BBCD, ACCD, ABDD and BCCD. And further new sequences will be found. Iterations PSI-BLAST can detect distant evolutionary relationships, especially when the proteins returned by the first round of search are all hypothetical proteins, unknown proteins or predicted proteins. BBCD BACD BBAD BBCA BCAD BCCD BCBD ABCD ACCD ACBD BCDD ACCB CBDD ABDD ACDD ABDC

Upgraded BLAST: PSI-BLAST The NCBI BLAST server : PSI-BLAST http://blast.ncbi.nlm.nih.gov

Upgraded BLAST: PSI-BLAST The NCBI BLAST server : PSI-BLAST http://blast.ncbi.nlm.nih.gov

Upgraded BLAST: PSI-BLAST The NCBI BLAST server : PSI-BLAST http://blast.ncbi.nlm.nih.gov

Upgraded BLAST: PHI-BLAST PHI (Pattern-Hit Initiated)-BLAST: in every round of BLAST (iteration), you are required to give a sequence pattern to filter the results. Only the BLAST results that match the pattern are regarded as results. Sequence pattern: [LIVMF]-G-E-x-[GAS]-[LIVM]-x(3,7) Yes: No: VGEAAMPRI VGEAAYPRI PHI-BLAST can find very exact friends.

Upgraded BLAST: PHI-BLAST The NCBI BLAST server : PHI-BLAST http://blast.ncbi.nlm.nih.gov

The Most Popular Search Tool: BLAST PSI-BLAST BLAST PHI- BLAST Query

Similarity Searches for Free over the Internet Location USA Europe Europe Japan BLAST Servers around the World Server URL NCBI http://www.ncbi.nlm.nih.gov/blast ExPASy http://web.expasy.org/blast EBI http://www.ebi.ac.uk/tools/sss DDBJ http://blast.ddbj.nig.ac.jp WU-BLAST - WU stands for Washington University. More sensitive and more gifted at inserting gaps than NCBI-BLAST. Smith and Waterman (SSEARCH): It s slower, but more accurate than BLAST. FASTA: It s a bit slower than BLAST but more accurate when making DNA comparisons. BLAT: Use this for locating cdna rapidly in a genome or finding close (mammalian vs. mammalian) proteins in a genome.

Comparing Two Sequences can help you to Convince yourself that two sequences are in fact homologous; Find out that your sequences share a domain; Identify the exact location of common features, such as disulfide bridges or catalytic active sites. Domain: a structural and functional unit in a protein. single-domain protein multiple-domain protein

Comparing Two Sequences Methods: dot plot, global/local alignment Dot plot is the simplest means of comparing two sequences. In fact, dot plot is the only type you can do with pencil and paper, without computer. Advantages: no biological hypothesis required; results can be analyzed with your eyes. Seq1: THEFASTCAT Seq2: THEFATCAT length(seq1) = 1 length(seq2) = 9 1 x 9 = 9 comparisons T H E F A S T C A T T x x x H x E x F x A x x T x x x C x A x x T x x x

Seq1: THEFASTCAT Seq2: THEFATCAT Introduction to Bioinformatics Comparing Two Sequences The diagonals indicate the segments of similarity between the two sequences. 1. THEFA 2. TCAT 3. AT Seq 2 Seq 1 T H E F A S T C A T T x x x H x E x F x A x x T x x x C x A x x T x x x

Comparing Two Sequences You can also do dot plot for one sequence to discover repeated subsequences hidden in it. Seq1: THEFASTHE T H E F A S T H E T x x H x x E x x F x A x S x T x x H x x E x x

Comparing Two Sequences Name Dotlet Dnadot Dotter Dottup Dot plot servers URL http://myhits.isb-sib.ch/cgi-bin/dotlet http://arbl.cvmbs.colostate.edu/molkit/dnadot http://sonnhammer.sbc.su.se/dotter.html http://emboss.sourceforge.net

Comparing Two Sequences Introduction to Bioinformatics Dotlet servers http://myhits.isb-sib.ch/cgi-bin/dotlet

Comparing Two Sequences Introduction to Bioinformatics Dotlet servers http://myhits.isb-sib.ch/cgi-bin/dotlet The Sequence Input Dialog seq1 dotlet.fasta

Comparing Two Sequences Introduction to Bioinformatics window size zoom Dotlet servers http://myhits.isb-sib.ch/cgi-bin/dotlet The dots window will display the diagonal plot. Histogram window defines the grayscale alignment window

Comparing Two Sequences Use Dot Plot to detect tandem repeats in a sequence. Tandem repeat: two or more repeated units directly adjacent to each other. Example: CCCABCABCABCDDD Introduction to Bioinformatics They are often used by evolution to create new proteins or make them function more efficiently. Short Tandem Repeat (STR) in DNA describes a pattern that helps determine an individual's inherited traits. A short tandem repeat polymorphism (STRP) occurs when homologous STR loci differ in the number of repeats between individuals. By identifying repeats of a specific sequence at specific locations in the genome, it is possible to create a genetic profile of an individual. There are currently over 1, published STR sequences in the human genome. STR analysis has become the prevalent analysis method for determining genetic profiles in forensic cases.

Comparing Two Sequences Use Dot Plot to detect tandem repeats in a sequence. Tandem repeats: two or more repeated units directly adjacent to each other. Example: CCCABCABCABCDDD Introduction to Bioinformatics C C C A B C A B C A B C D D D C x C x C x A x x x B x x x C x x x A x x x B x x x C x x x A x x x B x x x C x x x D x D x D x

Comparing Two Sequences Introduction to Bioinformatics Use Dot Plot to detect tandem repeats in a sequence. tandem.fasta

Comparing Two Sequences Introduction to Bioinformatics Use Dot Plot to detect tandem repeats in a sequence.

Comparing Two Sequences Introduction to Bioinformatics Use Dot Plot to detect tandem repeats in a sequence. 1. The number of repeats is equal to the number of diagonals including the main diagonal. 2. The distance between two adjacent diagonals represents the length of the repeat. 3. The shortest diagonal gives you a single repeat unit.

Alignment Introduction to Bioinformatics An alignment is an arrangement of two protein or DNA sequences to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Gaps are inserted between the residues so that identical or similar characters are aligned in the same columns. Global alignment is most useful when the two sequences are similar and of roughly equal size. Local alignment is more useful for dissimilar sequences that are suspected to contain segments of similarity.

Alignment Introduction to Bioinformatics A substitution matrix BLOSUM62 gives a score for every pair of amino acids, defining what is similar and how similar.

Alignment Usages of global alignment: Checking minor differences between two sequences. This may happen with data that you ve manipulated and possibly altered. The global alignment is the best way to localize potential problems. Analyzing polymorphisms (for example, SNPs) between closely related sequences. Comparing two sequences that partly overlap. In that case, you want to make a global pairwise comparison that doesn t penalize misalignments at the extremities of the sequences. Usages of local alignment: Introduction to Bioinformatics Comparing two distantly related sequences that share only a few noncontiguous domains. Analyzing repeated elements within a single sequence.

Global Alignment How is a global alignment generated? Input: Seq1: PYMNVI Seq2: PYELF substitution matrix (BLOSUM62) Introduction to Bioinformatics gap penalty (-1 by default ): The score of an arbitrary residue vs. another arbitrary residue is given in the substitution matrix; a gap penalty gives the score of an arbitrary residue vs. a gap. Output: PYMNVI PYMNVI PY-ELF or PYE-LF or? ** :. ** :.

Global Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: PYELF Step 1 P Y M N V I P Y E L F

Global Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: PYELF Step 2 P Y M N V I -1-2 -3-4 -5-6 P Y E L F -1-2 -3-4 -5

Global Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: PYELF Step 3 P Y M N V I S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap P Y E -1-2 -3-1 7-2 -3-4 -5-6 L -4 F -5

Global Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: PYELF S(1, 1) = max S(, ) + m(s1 1, s2 1 ) = + 7 = 7 S(1, ) + gap = -1 + (-1) = -2 S(, 1) + gap = -1 + (-1) = -2 S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Seq 2 P Seq 1 P S(, ) S(, 1) S(1, ) S(1, 1) -1-1 7

Global Alignment Seq1: PYMNVI Seq2: PYELF Step 4 S(, 1) = max Ø S(, ) + gap = + (-1) = -1 Ø P Y M N V I S(i, j) = max P -1-1 7-2 -3-4 -5-6 S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Y E -2-3 L -4 F -5

Global Alignment Seq1: PYMNVI Seq2: PYELF Step 5 P Y M N V I S(i, j) = max P -1-1 7-2 6-3 5-4 4-5 3-6 2 S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Y E -2-3 L -4 F -5

Global Alignment Seq1: PYMNVI Seq2: PYELF Step 6 S(3, 3) = max S(2, 2) + m(s1 3, s2 3 ) = 14+(-2) = 12 S(3, 2) + gap = 13 + (-1) = 12 S(2, 3) + gap = 13 + (-1) = 12 P Y M N V I S(i, j) = max P -1-1 7-2 6-3 5-4 4-5 3-6 2 S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Y E -2-3 6 5 14 13 13 12 12 13 11 12 1 11 L -4 4 12 15 14 14 14 F -5 3 11 14 13 13 14

Global Alignment Seq1: PYMNVI Seq2: PYELF Step 7 P Y M N V I S(i, j) = max P -1-1 7-2 6-3 5-4 4-5 3-6 2 S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Y E -2-3 6 5 14 13 13 12 12 13 11 12 1 11 L -4 4 12 15 14 14 14 F -5 3 11 14 13 13 14

Global Alignment Introduction to Bioinformatics Step 8 seq1 PYMNVI seq2 PY-ELF ** :. P -1 Y -2 M -3 N -4 V -5 I -6 P -1 7 6 5 4 3 2 Y -2 6 14 13 12 11 1 E -3 5 13 12 13 12 11 L -4 4 12 15 14 14 14 F -5 3 11 14 13 13 14 There is at less one path from the bottom-right to the top-left!

Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a DNA. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of similar residues relative to their length. Who and who are similar, who and who not? They are defined by a matrix, such as BLOSUM. Introduction to Bioinformatics What happens when two sequences have different lengths? seq 1 : CVHKA seq 2 : CIHL Identity? Similarity? So far, we can define them for sequences with different lengths with the help of global alignment.

Redefinition of Identity and Similarity Identity: The identity between two sequences is defined as the percent of identical residues in their global alignment. Similarity: The similarity between two sequences is defined as the percent of similar residues in their global alignment. PYMNVI PY-ELF ** :. Identity = 2 / 6 = 33.3% Similarity = 4 / 6 = 66.7%

Local Alignment How is a local alignment generated? Input: Seq1: PYMNVI Seq2: MN substitution matrix (BLOSUM62) Introduction to Bioinformatics gap penalty (-1 by default ): The score of an arbitrary residue vs. another arbitrary residue is given in the substitution matrix; a gap penalty gives the score of an arbitrary residue vs. a gap. Output: PYMNVI MN --MN-- or MN ** **

Local Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: MN Step 1 P Y M N V I M N

Local Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: MN Step 2 P Y M N V I M N

Local Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: MN Step 3 P Y M N V I M N S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap

Local Alignment Seq1: PYMNVI Seq2: MN S(1, 1) = max S(, ) + m(s1 1, s2 1 ) = + (-2) = -2 S(1, ) + gap = + (-1) = -1 S(, 1) + gap = + (-1) = -1 Seq 1 P S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Seq 2 M S(, ) S(, 1) S(1, ) S(1, 1)

Local Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: MN Step 4 P Y M N V I M 5 N S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap

Local Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: MN Step 5 P Y M N V I M 5 4 3 2 N 4 11 1 9 S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap

Local Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: MN Step 6 P Y M N V I M 5 4 3 2 N 4 11 1 9 Find the maximum of the two borders.

Local Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: MN Step 7 P Y M N V I M 5 4 3 2 N 4 11 1 9 Trace back until reach.

Local Alignment Introduction to Bioinformatics Seq1: PYMNVI Seq2: MN Step 8 P Y M N V I Result: MN MN ** M N 5 4 4 11 3 1 2 9

Making Global Alignment Over the Internet BLAST is an abbreviation of Basic Local Alignment Search Tool. In a BLAST search, how does the most similar sequence found? Is the query sequence aligned to each sequence of the entire database? No. A BLAST search among 1, sequences needs less than 2 minutes, while calculation of 1, alignments needs more than 1, minutes. BLAST uses a heuristic algorithm: Introduction to Bioinformatics What you need know is just how to use BLAST online.

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/ global.fasta

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/ small Gap Open + large Gap Extend

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/ small Gap Open + large Gap Extend = dispersive gaps in alignment

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/ large Gap Open + small Gap Extend = concentrative gaps in alignment

Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_needle/ adjust the gap open and gap extend according to your expectation Gap Open Gap Extend

Making Local Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_water

Making Local Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_water/ local.fasta

Making Local Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk/tools/psa/emboss_water/ >Seq1 SEQUENCEMHHHHHHSSGVDLGTENLYFQSMKTTQEQLKRNVRFHAFISYSEHDSLWVKNEL IPNLEKEDGSILICLYESYFDPGKSISENIVSFIEKSYKSIFVLSPNFVQNEWCHYEFYFAH HNLFHENSDHIILILLEPIPFYCIPTRYHKLKALLEKKAYLEWPKDRRKCGLFWANLRAAIN >Seq2 GTENLYFQSMKTTQEQLKRNVRFHAFISYSEHDSLWVKNELIPNLEKEDGSILICLYESYFD PGKEWCHYEFYFAHHNLFHENSDHIILILLEPIPFYCIPTRAAAAAAAAAAA

Different between Global and Local Alignments Global alignment Length: 186 Identity: 13/186 (55.4%) Similarity: 13/186 (55.4%) Local alignment Length: 13 Identity: 13/13 (79.2%) Similarity: 13/13 (79.2%)

Free Pairwise Alignment over the Internet Name EMBL PIR Lalign LAGAN AlignMe MCALIGN Online Pairwise Alignment Programs Alignment Type Global/Local Global Global/Local Global Alignment of Membrane Proteins alignment of non-coding DNA sequences URL http://www.ebi.ac.uk/tools/psa http://pir.georgetown.edu/pirwww/sea rch/pairwise.shtml http://www.ch.embnet.org/software/l ALIGN_form.html http://lagan.stanford.edu/lagan_web/i ndex.shtml http://www.bioinfo.mpg.de/alignme/al ignme.html http://homepages.ed.ac.uk/eang33/m calign/mcinstructions.html

Multiple Sequence Alignment A multiple sequence alignment (MSA) is a global sequence alignment of three or more sequences.

Multiple Sequence Alignment 4 main criteria for building a multiple sequence alignment : Structural similarity - Amino acids that play the same role in each structure are expected in the same column. This is very difficult; only structure-superposition programs can satisfy this criterion. Evolutionary similarity - Amino acids in the common ancestor of all the sequences are put in the same column. Indeed, no automatic program exactly uses this criterion, but they all try to respect it. Functional similarity - Amino acids with the same function are in the same column. Also, no automatic program exactly uses this criterion, but if the information is available, you can edit your alignment manually. Sequence similarity - Amino acids in the same column are those that yield an alignment with maximum similarity. Most programs take this, because it is the easiest criterion.

Multiple Sequence Alignment Main applications of MSA: Introduction to Bioinformatics 1. Extrapolation: whether an uncharacterized sequence is really a member of a protein family. 2. Phylogenetic analysis: the phylogenetic tree of aligned sequences can be reconstructed. 3. Pattern identification: very conserved positions with a certain function can be sent to generate sequence pattern or sequence logo. 4. Domain identification: to turn an MSA into a profile (position-specific weight matrix) that describes a protein domain.

Multiple Sequence Alignment Main applications of MSA: Introduction to Bioinformatics 5. DNA regulatory elements: to turn a DNA MSA of a binding site into a profile and scan other DNA sequences for potential binding sites. 6. Structure prediction: to predict protein/rna secondary structures by similarity. 7. nssnp analysis: MSA can help you predict whether a non-synonymous single-nucleotide polymorphism is likely to be harmful. 8. PCR analysis: a good multiple alignment can help you identify the less degenerated portions of a protein family, in order to fish out new members by PCR (polymerase chain reaction).

Choosing the Right Sequences MSA is not for an arbitrary group of sequences. Instead, the sequences should be members of the same protein family, and they all share a common ancestor.

Choosing the Right Sequences Naming sequences in the right way: Never use white spaces in your sequence names. Use the underline (_) to replace spaces. e.g. My Seq 1 My_Seq_1 Do not use special symbols. (such as Chinese symbols, @, #, &, ^ etc.). e.g. 我的序列壹 Introduction to Bioinformatics Seq1@li.com Never use names longer than 15 characters. e.g. This_is_my_favorite_sequence_about_mouse Never give the same name to two different sequences in your set. If you don t obey these naming rules, some MSA programs may automatically change the name of your sequences, without the courtesy of telling you.

Choosing the Right Sequences Choosing the right number of sequences: start with a relatively small number of sequence (1-15) increase its size, after you get something interesting happening with this small set. In any case, it s hard to see any reason for generating a MSA with > 5 sequences. If you start with hundreds of sequences, you immediately hit troubles: Computing big alignments is difficult. Building big alignments is difficult. Displaying big alignments is difficult. Using big alignments is difficult. Making accurate big alignments is difficult.

The most commonly used MSA packages. Before you start making multiple sequence alignments, you must know that none of the methods available today is perfect. They all use approximations. seq1 P Y M N V I seq3-1 -2-3 -4-5 -6 P -1 7 6 5 4 3 2 Y -2 6 14 13 12 11 1 E -3 5 13 12 13 12 11 L -4 4 12 15 14 14 14 F -5 3 11 14 13 13 14 seq2 seq1 seq2 2 sequences = 2D 3 sequences = 3D n sequences = nd

The most commonly used MSA packages. ClustalW - the most commonly used MSA package. Tcoffee - one of the latest MSA packages that you can use. MUSCLE - one of the fastest alignment methods around.

The most commonly used MSA packages. ClustalW is the latest of the Clustal software series. Clustal was the first multiple sequence alignment program. These days, with more than 35, citations, ClustalW is one of the most widely cited scientific publications in the history of biology. ClustalW uses a progressive algorithm. This means that it adds sequences one by one, instead of aligning all the sequences at the same time.

The most commonly used MSA packages. Name EBI PIR EMBnet BCM GenomeNet DDBJ Strasbourg Location Europe USA Europe USA Japan Japan Europe A List of ClustalW Servers URL http://www.ebi.ac.uk/tools/msa/clustalw2 http://pir.georgetown.edu/pirwww/search/ multialn.shtml http://www.ch.embnet.org/software/clust alw.html http://searchlauncher.bcm.tmc.edu/multialign/options/clustalw.html http://www.genome.jp/tools/clustalw http://clustalw.ddbj.nig.ac.jp/top-j.html http://bips.u-strasbg.fr/fr/documentation /ClustalW

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2 msa.fasta Human TLR1-1 s TIR domains

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2 The sequences in the alignment are sorted by the pairwise identity.

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2 Red: hydrophobic Blue: Acidic Magenta: Basic Green: Hydroxyl + Amine + Basic Gray: Others

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2 (*) A star indicates an entirely conserved column. (:) A double-dot indicates columns where all the residues have roughly the same size and the same hydropathy. (.) A single-dot indicates columns where the size or the hydropathy has been preserved in the course of evolution.

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2

The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2

The most commonly used MSA packages. Tcoffee is a recent method developed for conducting multiple sequence alignments. It uses a principle that s a bit similar to ClustalW, but it yields more accurate alignments at the cost of a slightly longer running time. Tcoffee builds a progressive alignment like ClustalW, but it compares segments across the entire sequence set. Home page : http://www.tcoffee.org http://tcoffee.crg.cat

The most commonly used MSA packages. Name SIB EBI CNRS Max-Planck CBSU EMBnet T-Coffee Mirror sites URL http://tcoffee.vital-it.ch http://www.ebi.ac.uk/tools/msa/tcoffee http://www.igs.cnrs-mrs.fr/tcoffee/tcoffee_cgi/ index.cgi http://toolkit.tuebingen.mpg.de/t_coffee http://cbsuapps.tc.cornell.edu/t_coffee.aspx http://www.es.embnet.org/services/molbio/t-coffee

The most commonly used MSA packages. Aside from its accuracy, the main specificity of Tcoffee is its ability to align sequences and structures (EXPRESSO), the possibility of evaluating the accuracy of an alignment (CORE) and the possibility of combining many alternative multiple sequence alignments into one (Mcoffee). Usage TCOFFEE CORE MCOFFEE EXPRESSO Available Tools on www.tcoffee.org Description Produce a multiple sequence alignment with Tcoffee. Evaluate the reliability of an existing multiple alignment Run any requested Multiple sequence Alignment package and combine all the output into one final alignment. Incorporate all the available structural information in your alignment. Will produce the best sequence alignments if the structures are available.

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat Human TLR1-1 s TIR domains msa.fasta

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat fasta_aln file score_html file phylip file clustalw_aln file

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat When you choose to store your data in a specific format, you must ask yourself four questions: Do most programs support this format? Will my collaborators be able to use it? Can I store all the information I need with this format? Is it easy to manipulate? If the program you re using doesn t produce alignments in the format you need, it is possible to use a third-party conversion tool to get to the format you want. fmtseq : http://www.bioinformatics.org/jambw/1/2 http://evol.mcmaster.ca/pise/5.a/fmtseq.html or

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat EXPRESSO is the latest development of Tcoffee, replacing what was known as 3D-Coffee. When you run Expresso, the program uses BLAST to search the PDB for structures whose sequences are similar to your sequences. It then uses theses structures to guide the alignment. Alignments based on structures are expected to be much more accurate than simple sequence alignments.

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat EXPRESSO T-Coffee

The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat PDB ID

The most commonly used MSA packages. MUSCLE - is a newcomer in the MSA area but it is a remarkably efficient package for making fast, high-quality multiple sequence alignments. MUSCLE is ideal if you want to align several hundred sequences. Home page : http://www.drive5.c om/muscle

The most commonly used MSA packages. MUSCLE http://www.ebi.ac.uk/tools/msa/muscle

Searching conserved patterns One sentence summarizes what you really want from your multiple alignment: You want to identify important positions!

Searching conserved patterns One sentence summarizes what you really want from your multiple alignment: You want to identify important positions!

Searching conserved patterns One sentence summarizes what you really want from your multiple alignment: You want to identify important positions!

Searching conserved patterns Human TLR 1-TIR Human TLR 2-TIR Human TLR 1-TIR BB-Loop BB-Loop - is important for the TIR domain dimerization and interaction with downstream adaptors or inhibitors.

Getting Your Multiple Alignment in the Right Format fasta_aln file score_html file phylip file clustalw_aln file

Editing and Publishing Alignments For editing and publishing a multiple sequence alignment, bioinformaticans have developed text editors that are specific for multiple sequence alignment. They make it easy for you to see exactly what s going on. Most of these editors require that you install something on your computer. However, if you want to stick to your browser, you can use Jalview. Jalview is a Java applet that you need only load into your Web browser for instant action. Home page : http://www.jalview.org Do not load confidential sequences! Web interface is NOT secure.

Editing and Publishing Alignments EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html run

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html Close ALL the windows that appear within the Jalview Window, as they only contain sample data.

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html results.clustalw

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html http://www.jalview.org/help.html

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html Colour -> Clustalx

Editing and Publishing Alignments Jalview http://www.jalview.org/download.html Colour -> Clustalx http://www.jalview.org/help.html

Editing and Publishing Alignments When you edit an alignment, you usually want to do is collectively modify the alignment. To do this, you need to define them as a group, as follows: Keep the Ctrl key pressed while you click names of sequences 1, 2, 3 and 4 to select them.

Editing and Publishing Alignments 1. Keep the Ctrl key pressed. 2. Put your mouse pointer right where you want to insert or remove the gap. 3. Drag to the left or to the right to shift your sequences You can edit one sequence at a time by pressing the Shift key instead of Ctrl.

Editing and Publishing Alignments perform Pairwise Alignment for a pair of selectd sequences

Editing and Publishing Alignments calculate tree for all selected sequences

Editing and Publishing Alignments predict secondary structure for a selected sequence.

Editing and Publishing Alignments JNet Secondary Structure Prediction result

Editing and Publishing Alignments save your alignment as a text/picture

Editing and Publishing Alignments Showtime has finally come: You have the multiple alignment you want, and you re determined to show the world!

Editing and Publishing Alignments Name JalView Boxshade ESPript MView URL Multiple Alignment Beautifying Tools http://www.jalview.org http://www.ch.embnet.org/software/b OX_form.html http://espript.ibcp.fr/espript/espript http://bio-mview.sourceforge.net Description A multiple alignment editor written in Java Shading in black and white A very powerful shading and-coloring tool Adding optional HTML markup to control coloring and web page layout

exercise.fasta Can you make a MSA for these 5 protein sequences? Which two sequences are the most similar ones? How similar are they? (i.e. How about their sequence identity?) What kind of proteins are they?

Notice: Next time (211/1/19) we will move to 8# building, 2nd floor, west, 多媒体教室 DUOMEITIJIAOSHI.