Introduction to Bioinformatics Introduction to Bioinformatics

Dr. rer. nat. Gong Jing Cancer Research Center Medicine School of Shandong University 2012.11.07 1

Chapter 3 Alignment 2

Similarity Searches on Sequence Databases In the game of Mahjong Titans, you want to find the same symbol from a collection of symbols a certain one. What you can do is to compare the symbol with every one, with your eyes. 3

Similarity Searches on Sequence Databases For a protein or DNA sequence, similarity search means finding a similar one from a collection of sequences a query sequence. BLAST > 100,000 4

The Importance of Similarity Similar sequences often derive from a common ancestral sequence. They probably share similar structure and biological function. You can infer something you know about a particular DNA or protein sequence to all similar DNA or protein sequences. Similar structures Similar functions Similar sequences 5

Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a nucleotide. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of identical and similar residues relative to their length. My name is Lampy. Similar or not: defined by a matrix, such as BLOSUM. 7

Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a nucleotide. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of identical and similar residues relative to their length. Similar or not: defined by a matrix, such as BLOSUM. seq 1 : CLHK seq 2 : CIHL Identity = 2/4 = 50% Similarity = 3/4 = 75% Identical similar seq 1 : C L H K seq 2 : C I H L seq 1 : C L H K seq 2 : C I H L 8

Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a nucleotide. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of similar residues relative to their length. Who and who are similar, who and who not? They are defined by a matrix, such as BLOSUM. What happens when two sequences have different lengths? seq 1 : CLHKA seq 2 : CIHL Identity? Similarity? 9

Identity and Similarity Homologous: In general, if two protein sequences have an identity > 25%, or two DNA sequences have an identity > 70%, they can be regarded as homologous. Nothing is sure about the meaning of observed similarity. Some protein sequences are less than 15% identical, but they have the same 3D structure, while some are 25% identical, but they have different structures. Homology or non-homology is never granted. The 25% cutoff is mostly a common-sense indicator. In most cases, to make sure whether two sequences are true homologous, you need to consider many other things. Homology is a binary relationship: yes or no; similarity or identity is a quantifiable property: 0%-100%. 10

The Most Popular Search Tool: BLAST BLAST (Basic Local Alignment Search Tool) A sequence comparison algorithm optimized speed used to search sequence databases optimal local alignments to a query. Different kinds of BLAST (according to the query): BLASTn: Search a nucleotide database using a nucleotide query. BLASTp: Search protein database using a protein query. BLASTx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query. tblastx: Search translated nucleotide database using a translated nucleotide query. 11

The Most Popular Search Tool: BLAST The NCBI BLAST server http://www.ncbi.nlm.nih.gov/ 13 13

The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp http://blast.ncbi.nlm.nih.gov 14 14

The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp http://blast.ncbi.nlm.nih.gov blast.fasta 15 15

The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp http://blast.ncbi.nlm.nih.gov http://www.crc.sdu.edu.cn/bioinfo/2012 16 16

The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp http://blast.ncbi.nlm.nih.gov blast.fasta BLAST another sequence at the same time give a name to your job query only a part of your sequence 17 17

The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp http://blast.ncbi.nlm.nih.gov select against which database you want to search 18 18

The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp http://blast.ncbi.nlm.nih.gov limit the search range to a certain species, e.g. human select algorithm 19 19

The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp http://blast.ncbi.nlm.nih.gov Part 1 : a brief summary 20 20

The Most Popular Search Tool: BLAST Part 2 : graphic summary sequence length and classification of the input protein. an overview of similar sequences 21 21

The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp http://blast.ncbi.nlm.nih.gov Part 3 : descriptions go to the corresponding database entry go to the alignment between your query sequence and the matching sequence 22 22

The Most Popular Search Tool: BLAST Part 4 : Alignment 23 23

English Courses Upgraded BLAST: PSI-BLAST Sometimes the standard BLAST is not enough. For instance, you want to catch all the members of a very large protein family, starting with one sequence that you have. When running BLAST, you catch only the most closely related sequences. The other distant members would not be found. In other words, you find your direct friends, but the friends of your friends are missing. PSI (Position-Specific Iterated)-BLAST first looks sequences that are closely related to yours; and then, gradually, it extends the circle of friends to include sequences that are distantly related. - How does PSI-BLAST extend the circle of friends? - A Position-Specific Weight Matrix and Iterations. 24

English Courses Position-Specific Weight Matrix A Position-Specific Weight Matrix describes the letter distribution of each position (column) a family of sequences. The distributions can be presented as probabilities or other statistic values. Seq1: A B C D Seq2: B B C D Seq3: A C C D Seq4: A B D D 1 2 3 4 A 75% 0 0 0 B 25% 75% 0 0 C 0 25% 75% 0 D 0 0 25% 100% 25

English Courses Upgraded BLAST: PSI-BLAST For the query sequence ABCD, the first round of search (first iteration) of PSI-BLAST is just like BLAST. All closely related sequences BBCD, ACCD and ABDD that have one different letter are found, but BCCD that has two different residues is missing. Then, a Position-Specific Weight Matrix is made ABCD, BBCD, ACCD and ABDD. This matrix is used in the second round of search (second iteration). Since BCCD matches the matrix, now it is found. And then, a second matrix is made ABCD, BBCD, ACCD, ABDD and BCCD. And then new sequences will be found. Iterations PSI-BLAST can detect distant evolutionary relationships, especially when the proteins returned by the first round of search are all hypothetical proteins, unknown proteins or predicted proteins. BACD BBCD BBAD BBCA BCAD BCCD BCBD ABCD ACCD ACBD BCDD ACCB CBDD ABDD ACDD ABDC 26

English Courses Upgraded BLAST: PSI-BLAST The NCBI BLAST server: PSI-BLAST http://blast.ncbi.nlm.nih.gov 27 27

English Courses Upgraded BLAST: PSI-BLAST The NCBI BLAST server: PSI-BLAST http://blast.ncbi.nlm.nih.gov 28

English Courses Upgraded BLAST: PSI-BLAST The NCBI BLAST server: PSI-BLAST http://blast.ncbi.nlm.nih.gov 29 29

English Courses Upgraded BLAST: PHI-BLAST PHI (Pattern-Hit Initiated)-BLAST: in every round of BLAST (iteration), you are required to give a sequence pattern to filter the results. Only the BLAST results that match the pattern are regarded as results. Sequence pattern: [LIVMF]-G-E-x-[GAS]-[LIVM]-x(3,7) Yes: No: VGEAAMPRI VGEAAYPRI PHI-BLAST can find very exact friends. 30

English Courses Upgraded BLAST: PHI-BLAST The NCBI BLAST server: PHI-BLAST http://blast.ncbi.nlm.nih.gov 31 31

English Courses Upgraded BLAST: PHI-BLAST PSI-BLAST BLAST PHI- BLAST Query 32

English Courses Similarity Searches Free over the Internet BLAST Servers around the World Location Server URL USA NCBI http://www.ncbi.nlm.nih.gov/blast Europe ExPASy http://web.expasy.org/blast Europe EBI http://www.ebi.ac.uk/tools/sss Japan DDBJ http://blast.ddbj.nig.ac.jp WU-BLAST - WU stands Washington University. More sensitive and more gifted at inserting gaps than NCBI-BLAST. Smith and Waterman (SSEARCH): It s slower, but more accurate than BLAST. FASTA: It s a bit slower than BLAST but more accurate when making DNA comparisons. BLAT: Use this locating cdna rapidly in a genome or finding close (mammalian vs. mammalian) proteins in a genome. 33

Comparing Two Sequences can help you to Convince yourself that two sequences are in fact homologous; Find out that your sequences share a domain; Identify the exact location of common features, such as disulfide bridges or catalytic active sites. English Courses English Courses Domain: a structural and functional unit in a protein. single-domain protein multiple-domain protein 34

English Courses Comparing Two Sequences: Dot plot Methods: dot plot, global/local alignment Dot plot is the simplest means of comparing two sequences. In fact, dot plot is the only type you can do with pencil and paper, without computer. Advantages: no biological hypothesis required; results can be analyzed with your eyes. Seq 1 T H E F A S T C A T Seq1: THEFASTCAT T x x x H x Seq2: THEFATCAT E x F x A x x T x x x length(seq1) = 10 C x length(seq2) = 9 A x x 10 x 9 = 90 comparisons T x x x Seq 2 35

English Courses Comparing Two Sequences: Dot plot The diagonals indicate the segments of similarity between the two sequences. 1. THEFA 2. TCAT 3. AT Seq 1 T H E F A S T C A T Seq1: THEFASTCAT T x x x H x Seq2: THEFATCAT E x F x A x x T x x x C x A x x T x x x Seq 2 36

English Courses Comparing Two Sequences: Dot plot You can also do dot plot one sequence to discover repeated subsequences hidden in it. Seq1: THEFASTHE Seq 1 Seq 1 T H E F A S T H E T x x H x x E x x F x A x S x T x x H x x E x x 37

English Courses Comparing Two Sequences: Dot plot Name Dotlet Dnadot Dotter Dottup Dot plot servers URL http://myhits.isb-sib.ch/cgi-bin/dotlet http://arbl.cvmbs.colostate.edu/molkit/dnadot http://sonnhammer.sbc.su.se/dotter.html http://emboss.sourcege.net 38

English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet http://myhits.isb-sib.ch/cgi-bin/dotlet 39

English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet http://myhits.isb-sib.ch/cgi-bin/dotlet The Sequence Input Dialog dotlet.fasta 40

English Courses Comparing Two Sequences: Dot plot Substitution matrix, e.g. Blosum62 window size zoom The dots window will display the diagonal plot. Histogram window defines the grayscale alignment window 41

Use Dot Plot to detect tandem repeats in a sequence. Tandem repeat: two or more repeated units directly adjacent to each other. Example: CCCABCABCABCDDD English Courses English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet http://myhits.isb-sib.ch/cgi-bin/dotlet They are often used by evolution to create new proteins or make them function more efficiently. Short Tandem Repeat (STR) in DNA describes a pattern that helps determine an individual's inherited traits. A short tandem repeat polymorphism (STRP) occurs when homologous STR loci differ in the number of repeats between individuals. By identifying repeats of a specific sequence at specific locations in the genome, it is possible to create a genetic profile of an individual. There are currently over 10,000 published STR sequences in the human genome. STR analysis has become the prevalent analysis method determining genetic profiles in ensic cases. 42

English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet http://myhits.isb-sib.ch/cgi-bin/dotlet Use Dot Plot to detect tandem repeats in a sequence. tandem.fasta 44

English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet http://myhits.isb-sib.ch/cgi-bin/dotlet Use Dot Plot to detect tandem repeats in a sequence. 45

English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet http://myhits.isb-sib.ch/cgi-bin/dotlet Use Dot Plot to detect tandem repeats in a sequence. 1. The number of repeats is equal to the number of diagonals including the main diagonal. 2. The distance between two adjacent diagonals represents the length of the repeat. 3. The shortest diagonal gives you a single repeat unit. 46

English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet Use Dot Plot to detect tandem repeats in a sequence. http://myhits.isb-sib.ch/cgi-bin/dotlet Tandem repeat: two or more repeated units directly adjacent to each other. Example: CCCABCABCABCDDD C C C A B C A B C A B C D D D C x C x 1. The number of repeats is equal to C x A x x x the number of diagonals including B x x x the main diagonal. C x x x A x x x 2. The distance between two adjacent B x x x diagonals represents the length of C x x x the repeat. A x x x B x x x 3. The shortest diagonal gives you a C x x x single repeat unit. D x D x D x 47

English Courses Comparing Two Sequences: Alignment An alignment is an arrangement of two protein or DNA sequences to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Global alignment is most useful when the two sequences are similar and of roughly equal size. Local alignment is more useful dissimilar sequences that are suspected to contain segments of similarity. 48

English Courses Comparing Two Sequences: Alignment A substitution matrix BLOSUM62 gives a score every pair of amino acids, defining what is similar and how similar. 49

Usages of global alignment: Checking minor differences between two sequences. Analyzing polymorphisms between closely related sequences. Comparing two sequences that partly overlap. Usages of local alignment: English Courses English Courses Comparing Two Sequences: Alignment Comparing two distantly related sequences that share only a few noncontiguous domains. Analyzing repeated elements within a single sequence. 50

How to generate a global alignment? Input: Seq1: PYMNVI Seq2: PYELF substitution matrix (BLOSUM62) gap penalty (-1 by default ): The score of a residue vs. another residue is given by the substitution matrix; a gap penalty gives the score of a residue vs. a gap. Output: English Courses English Courses Comparing Two Sequences: Global Alignment PYMNVI PYMNVI PYMNVI-- PY-ELF or PYE-LF or PY---ELF or? 51

English Courses Comparing Two Sequences: Global Alignment Seq1: PYMNVI Seq2: PYELF Step 1 - P Y M N V I - P Y E L F 52

English Courses Comparing Two Sequences: Global Alignment Seq1: PYMNVI Seq2: PYELF Step 2 - P Y M N V I - 0-1 -2-3 -4-5 -6 P -1 Y -2 E -3 L -4 F -5 53

English Courses Comparing Two Sequences: Global Alignment Seq1: PYMNVI Seq2: PYELF S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Step 3 - P Y S(3, 3) = max - 0-1 -2 P -1 7 6 Y S(2, 2) + m(s1 3, s2 3 ) = 14+(-2) = 12 S(3, 2) + gap = 13 + (-1) = 12 S(2, 3) + gap = 13 + (-1) = 12-2 6 14 M -3 5 13 N -4 4 12 V -5 3 11 I -6 2 10 E -3 5 13 12 13 12 11 L -4 4 12 15 14 14 14 F -5 3 11 14 13 13 14 54

English Courses Comparing Two Sequences: Global Alignment Seq1: PYMNVI Seq2: PYELF Step 4 There is at less one path from the lower right corner to the top left corner! S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap - P Y - 0-1 -2 P -1 7 6 Y -2 6 14 M -3 5 13 N -4 4 12 V -5 3 11 I -6 2 10 Output: seq1 PYMNVI seq2 PY-ELF ** :. E L F -3-4 -5 5 4 3 13 12 11 12 15 14 13 14 13 12 14 13 11 14 14 55

Identity and Similarity English Courses English Courses Residue: a letter; an amino acid in a protein; a base in a DNA. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of similar residues (including identical residues) relative to their length. Who and who are similar, who and who not? They are defined by a matrix, such as BLOSUM. What happens when two sequences have different lengths? seq 1 : CVHKA seq 2 : CIHL Identity? Similarity? So far, we can define them sequences with different lengths with the help of global alignment. 56

English Courses Redefinition of Identity and Similarity Identity: The identity between two sequences is defined as the percent of identical residues in their global alignment. Similarity: The similarity between two sequences is defined as the percent of similar residues (including identical residues) in their global alignment. global alignment PYMNVI PY-ELF ** :. Identity = 2 / 6 = 33.3% Similarity = 3 / 6 = 50.0% 57

How to generate a local alignment? Input: Seq1: PYMNVI Seq2: MN substitution matrix (BLOSUM62) gap penalty (-1 by default ): The score of an arbitrary residue vs. another arbitrary residue is given in the substitution matrix; a gap penalty gives the score of an arbitrary residue vs. a gap. Output: English Courses English Courses Comparing Two Sequences: Local Alignment PYMNVI MN --MN-- or MN or? ** ** 58

English Courses Comparing Two Sequences: Local Alignment Seq1: PYMNVI Seq2: MN S(i, j) = max 0 S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Output: MN MN ** - - 0 P 0 Y 0 M 0 N 0 V 0 I 0 M 0 0 0 5 4 3 2 N 0 0 0 4 11 10 9 59

BLAST is an abbreviation of Basic Local Alignment Search Tool. In a BLAST search, how does the most similar sequence found? Is the query sequence aligned to each sequence of the entire database? No. A BLAST search among 100,000 sequences needs 2 minutes, while calculation of 100,000 alignments needs > 10,000 minutes. BLAST uses a heuristic algorithm: English Courses English Courses Making Global Alignment Over the Internet 60

English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk 61

English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk 62 62

English Courses Making Global Alignment Over the Internet global.fasta 63

English Courses Making Global Alignment Over the Internet 64

English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk 65

English Courses Making Global Alignment Over the Internet 66

English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk small Gap Open + large Gap Extend 67

English Courses Making Global Alignment Over the Internet small Gap Open + large Gap Extend = dispersive gaps in alignment 68

English Courses Making Global Alignment Over the Internet large Gap Open + small Gap Extend = concentrative gaps in alignment 69

English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk adjust the gap open and gap extend according to your expectation Gap Open Gap Extend 70

English Courses Making Local Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk 71

English Courses Making Local Alignment Over the Internet EMBL Alignment Tool: http://www.ebi.ac.uk local.fasta 72

EMBL Alignment Tool: English Courses English Courses Making Local Alignment Over the Internet http://www.ebi.ac.uk >Seq1 SEQUENCEMHHHHHHSSGVDLGTENLYFQSMKTTQEQLKRNVRFHAFISYSEHDSLWVKNEL IPNLEKEDGSILICLYESYFDPGKSISENIVSFIEKSYKSIFVLSPNFVQNEWCHYEFYFAH HNLFHENSDHIILILLEPIPFYCIPTRYHKLKALLEKKAYLEWPKDRRKCGLFWANLRAAIN >Seq2 GTENLYFQSMKTTQEQLKRNVRFHAFISYSEHDSLWVKNELIPNLEKEDGSILICLYESYFD PGKEWCHYEFYFAHHNLFHENSDHIILILLEPIPFYCIPTRAAAAAAAAAAA 73 73

English Courses Different between Global and Local Alignments Global alignment Length: 186 Identity: 103/186 (55.4%) Similarity: 103/186 (55.4%) Local alignment Length: 130 Identity: 103/130 (79.2%) Similarity: 103/130 (79.2%) 74

English Courses Free Pairwise Alignment over the Internet Name EMBL PIR Lalign LAGAN AlignMe MCALIGN Online Pairwise Alignment Programs Alignment Type Global/Local Global Global/Local Global Alignment of Membrane Proteins alignment of non-coding DNA sequences URL http://www.ebi.ac.uk/tools/psa http://pir.georgetown.edu/pirwww/sea rch/pairwise.shtml http://www.ch.embnet.org/software/l ALIGN_m.html http://lagan.stand.edu/lagan_web/i ndex.shtml http://www.bioinfo.mpg.de/alignme/al ignme.html http://homepages.ed.ac.uk/eang33/m calign/mcinstructions.html 75

English Courses Multiple Sequence Alignment A multiple sequence alignment (MSA) is a global sequence alignment of three or more sequences. 76

English Courses Multiple Sequence Alignment 4 main criteria building a multiple sequence alignment : Structural similarity - Amino acids that play the same role in each structure are expected in the same column. This is very difficult; only structure-superimposition programs can satisfy this criterion. Evolutionary similarity - Amino acids in the common ancestor of all the sequences are put in the same column. Indeed, no automatic program exactly uses this criterion, but they all try to respect it. Functional similarity - Amino acids with the same function are in the same column. Also, no automatic program exactly uses this criterion, but if the inmation is available, you can edit your alignment manually. Sequence similarity - Amino acids in the same column are those that yield an alignment with maximum similarity. Most programs take this, because it is the easiest criterion. 77

Multiple Sequence Alignment Main applications of MSA: English Courses English Courses 1. Extrapolation: whether an uncharacterized sequence is really a member of a protein family. 2. Phylogenetic analysis: the phylogenetic tree of aligned sequences can be reconstructed. 3. Pattern identification: very conserved positions with a certain function can be sent to generate sequence pattern or sequence logo. 4. Domain identification: to turn an MSA into a profile (position-specific weight matrix) that describes a protein domain. 78

Multiple Sequence Alignment Main applications of MSA: English Courses English Courses 5. DNA regulatory elements: to turn a DNA MSA of a binding site into a profile and scan other DNA sequences potential binding sites. 6. Structure prediction: to predict protein/rna secondary structures by similarity. 7. nssnp analysis: MSA can help you predict whether a non-synonymous single-nucleotide polymorphism is likely to be harmful. 8. PCR analysis: a good multiple alignment can help you identify the less degenerated portions of a protein family, in order to fish out new members by PCR (polymerase chain reaction). 79

English Courses Choosing the Right Sequences MSA is not an arbitrary group of sequences. Instead, the sequences should be members of the same protein family, and they all share a common ancestor. 80

Choosing the Right Sequences Naming sequences in the right way: Never use white spaces in your sequence names. Use the underline (_) to replace spaces. e.g. My Seq 1 My_Seq_1 Do not use special symbols. (such as Chinese symbols, @, #, &, ^ etc.). e.g. 我的序列壹 English Courses English Courses Seq1@li.com Never use names longer than 15 characters. e.g. This_is_my_favorite_sequence_about_mouse Never give the same name to two different sequences in your set. If you don t obey these naming rules, some MSA programs may automatically change the name of your sequences, without telling you. 81

English Courses Choosing the Right Sequences Choosing the right number of sequences: start with a relatively small number of sequence (10-15) increase its size, after you get something interesting happening with this small set. In any case, it s hard to see any reason generating a MSA with > 50 sequences. If you start with hundreds of sequences, you immediately hit troubles: Computing big alignments is difficult. Building big alignments is difficult. Displaying big alignments is difficult. Using big alignments is difficult. Making accurate big alignments is difficult. 82

English Courses The most commonly used MSA packages. Bee you start making multiple sequence alignments, you must know that none of the methods available today is perfect. They all use approximations. seq1 P Y M N V I seq3 0-1 -2-3 -4-5 -6 P -1 7 6 5 4 3 2 Y -2 6 14 13 12 11 10 E -3 5 13 12 13 12 11 L -4 4 12 15 14 14 14 F -5 3 11 14 13 13 14 seq2 seq1 seq2 2 sequences = 2D 3 sequences = 3D n sequences = nd 83

English Courses The most commonly used MSA packages. ClustalW - the most commonly used MSA package. Tcoffee - one of the latest MSA packages. MUSCLE - one of the fastest alignment methods. 84

English Courses The most commonly used MSA packages. ClustalW is the latest of the Clustal software series. Clustal was the first multiple sequence alignment program. These days, with more than 35,000 citations, ClustalW is one of the most widely cited scientific publications in the history of biology. ClustalW uses a progressive algorithm. This means that it adds sequences one by one, instead of aligning all the sequences at the same time. 85

English Courses The most commonly used MSA packages. Name EBI PIR EMBnet BCM GenomeNet DDBJ Strasbourg Location Europe USA Europe USA Japan Japan Europe A List of ClustalW Servers URL http://www.ebi.ac.uk/tools/msa/clustalw2 http://pir.georgetown.edu/pirwww/search/ multialn.shtml http://www.ch.embnet.org/software/clust alw.html http://searchlauncher.bcm.tmc.edu/multialign/options/clustalw.html http://www.genome.jp/tools/clustalw http://clustalw.ddbj.nig.ac.jp/top-j.html http://bips.u-strasbg.fr/fr/documentation /ClustalW 86

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk 87 87

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk msa.fasta Human TLR1-10 s TIR domains 88

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk 89

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk 90

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk The sequences in the alignment are sorted by the pairwise identity. 91

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk Red: hydrophobic Blue: Acidic Magenta: Basic Green: Hydroxyl + Amine + Basic Gray: Others 92

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk * Asterisk - an entirely conserved column. : Double-dot - columns where all the residues have roughly the same size and the same hydropathy.. Single-dot - columns where the size or the hydropathy has been preserved in the course of evolution. 93

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk 94

English Courses The most commonly used MSA packages. EMBL ClustalW http://www.ebi.ac.uk The guide tree is NOT a phylogenetic tree! 95

English Courses The most commonly used MSA packages. Tcoffee is a recent method developed conducting multiple sequence alignments. It uses a principle that s a bit similar to ClustalW, but it yields more accurate alignments at the cost of a slightly longer running time. Tcoffee builds a progressive alignment like ClustalW, but it compares segments across the entire sequence set. Home page : http://www.tcoffee.org http://tcoffee.crg.cat 96

English Courses The most commonly used MSA packages. Name SIB EBI CNRS Max-Planck CBSU EMBnet T-Coffee Mirror sites URL http://tcoffee.vital-it.ch http://www.ebi.ac.uk/tools/msa/tcoffee http://www.igs.cnrs-mrs.fr/tcoffee/tcoffee_cgi/ index.cgi http://toolkit.tuebingen.mpg.de/t_coffee http://cbsuapps.tc.cornell.edu/t_coffee.aspx http://www.es.embnet.org/services/molbio/t-coffee 97

English Courses The most commonly used MSA packages. Aside from its accuracy, the main specificity of Tcoffee is its ability to align sequences and structures (EXPRESSO), the possibility of evaluating the accuracy of an alignment (CORE) and the possibility of combining many alternative multiple sequence alignments into one (Mcoffee). Usage TCOFFEE CORE MCOFFEE EXPRESSO Available Tools on www.tcoffee.org Description Produce a multiple sequence alignment with Tcoffee. Evaluate the reliability of an existing multiple alignment Run any requested Multiple sequence Alignment package and combine all the output into one final alignment. Incorporate all the available structural inmation in your alignment. Will produce the best sequence alignments if the structures are available. 98

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat 99 99

English Courses The most commonly used MSA packages. msa.fasta Human TLR1-10 s TIR domains 100 100

English Courses The most commonly used MSA packages. http://tcoffee.crg.cat 101 101

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat 102 102

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat fasta_aln file score_html file phylip file clustalw_aln file 103 103

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat When you choose to store your data in a specific mat, you must ask yourself four questions: Do most programs support this mat? Will my collaborators be able to use it? Can I store all the inmation I need with this mat? Is it easy to manipulate? If the program you re using doesn t produce alignments in the mat you need, it is possible to use a third-party conversion tool to get the mat you want. fmtseq : http://www.bioinmatics.org/jambw/1/2 http://evol.mcmaster.ca/pise/5.a/fmtseq.html or 104

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat 105

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat EXPRESSO is the latest development of Tcoffee, replacing what was known as 3D-Coffee. When you run Expresso, the program uses BLAST to search the PDB structures whose sequences are similar to your sequences. It then uses theses structures to guide the alignment. Alignments based on structures are expected to be much more accurate than simple sequence alignments. 106 106

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat 107 107

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat EXPRESSO T-Coffee 108

English Courses The most commonly used MSA packages. T-Coffee http://tcoffee.crg.cat PDB ID 109

English Courses The most commonly used MSA packages. MUSCLE - is a newcomer in the MSA area but it is a remarkably efficient package making fast, high-quality multiple sequence alignments. MUSCLE is ideal if you want to align several hundreds sequences. Home page : http://www.drive5.co m/muscle 110

English Courses The most commonly used MSA packages. MUSCLE http://www.ebi.ac.uk/tools/msa/muscle 111

English Courses Searching conserved patterns One sentence summarizes what you really want from your multiple alignment: You want to identify important positions! 112

Searching conserved patterns Sequence Logos: WebLogo English Courses English Courses Sequence logos - are a graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneider and Mike Stephens. Each logo consists of stacks of symbols, one stack each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position. In general, a sequence logo provides a richer and precise description of, example, a binding site. 113

English Courses Searching conserved patterns Sequence Logos: WebLogo http://weblogo.berkeley.edu WebLogo - is a web based application designed to make the generation of sequence logos easy and painless. WebLogo has featured in over 150 scientific publications. http://weblogo.berkeley.edu 114

English Courses Searching conserved patterns Sequence Logos: WebLogo http://weblogo.berkeley.edu promoter.seqs 115 115

English Courses Searching conserved patterns Sequence Logos: WebLogo http://weblogo.berkeley.edu 116

English Courses Searching conserved patterns Sequence Logos: WebLogo http://weblogo.berkeley.edu 20 30 117 117

English Courses Searching conserved patterns Sequence Logos: WebLogo http://weblogo.berkeley.edu In the promoter region of genes, we usually found a special fragment, called TATA box (also called Goldberg-Hogness box). The TATA box has the core DNA sequence 5'-TATAAT-3' or a variant. It is usually found as the binding site of RNA polymerase. http://correlogo.abcc.ncifcrf.gov 118

Searching conserved patterns Sequence Motif - a nucleotide or amino-acid sequence pattern that is widespread and has a biological significance. Example: N-glycosylation site motif English Courses English Courses Sequence Motifs: MEME http://meme.sdsc.edu/meme/intro.html Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro This pattern can be written as N{P}[ST]{P}(Regular expression), where N=Asn, P=Pro, S=Ser, T=Thr; {X} means any amino acid except X; and [XY] means either X or Y. The notation [XY] does not give any indication of the probability of X or Y occurring in the pattern. Observed probabilities can be graphically represented using sequence logos. 119

English Courses Searching conserved patterns Sequence Motifs: MEME http://meme.sdsc.edu/meme/intro.html The MEME Suite - Motif-based sequence analysis tools. The MEME Suite allows you to: discover motifs on groups of related DNA or protein sequences, search sequence databases using motifs, compare a motif to all motifs in a database of motifs. Home page : http://meme.sdsc.edu/meme/intro.html 120

English Courses Searching conserved patterns Sequence Motifs: MEME http://meme.sdsc.edu/meme/intro.html 121

English Courses Searching conserved patterns meme.seqs 122

English Courses Searching conserved patterns Sequence Motifs: MEME http://meme.sdsc.edu/meme/intro.html 123 123

English Courses Searching conserved patterns Sequence Motifs: MEME http://meme.sdsc.edu/meme/intro.html 124 124

English Courses Searching conserved patterns Sequence Motifs: MEME http://meme.sdsc.edu/meme/intro.html 125 125

English Courses Searching conserved patterns Sequence Motifs: MEME http://meme.sdsc.edu/meme/intro.html 126 126

English Courses Searching conserved patterns Sequence Motifs: MEME http://meme.sdsc.edu/meme/intro.html One sentence summarizes what you really want from your multiple alignment: You want to identify important positions! 127 127

English Courses Searching conserved patterns Human TLR 1-TIR Human TLR 2-TIR Human TLR 10-TIR BB-Loop BB-Loop - is important the TIR domain dimerization and interaction with downstream adaptors or inhibitors. 128 128

English Courses Editing and Publishing Alignments fasta_aln file score_html file phylip file clustalw_aln file 129 129

Editing and Publishing Alignments For editing and publishing a multiple sequence alignment, bioinmaticans have developed text editors that are specific multiple sequence alignment. They make it easy you to see exactly what s going on. Most of these editors require the installation of something on your computer. However, if you want to stick to your browser, you can use Jalview. Jalview is a Java applet that you need only load into your Web browser instant action. Home page : http://www.jalview.org English Courses English Courses Do not load confidential sequences! Web interface is NOT secure. 130 130

English Courses Editing and Publishing Alignments EMBL ClustalW http://www.ebi.ac.uk/tools/msa/clustalw2 131

English Courses Editing and Publishing Alignments Jalview http://www.jalview.org/download.html 132

English Courses Editing and Publishing Alignments Jalview http://www.jalview.org/download.html 133

English Courses Editing and Publishing Alignments Jalview http://www.jalview.org/download.html run 134

English Courses Close ALL the windows that appear within the Jalview Window, as they only contain sample data. 135 135

English Courses Editing and Publishing Alignments Jalview http://www.jalview.org/download.html results.clustalw 136 136

English Courses Editing and Publishing Alignments 137 137

Editing and Publishing Alignments Jalview English Courses English Courses http://www.jalview.org/download.html http://www.jalview.org/help.html 138 138

English Courses Editing and Publishing Alignments Jalview http://www.jalview.org/download.html 139 139

English Courses Editing and Publishing Alignments Jalview http://www.jalview.org/download.html Colour -> Clustalx 140 140

English Courses Editing and Publishing Alignments Jalview http://www.jalview.org/download.html Colour -> Clustalx http://www.jalview.org/help.html 141 141

English Courses Editing and Publishing Alignments When you edit an alignment, you usually want to do is collectively modify the alignment. To do this, you need to define them as a group, as follows: Keep the Ctrl key pressed while you click names of sequences 1, 2, 3 and 4 to select them. 142 142

English Courses Editing and Publishing Alignments 1. Keep the Ctrl key pressed. 2. Put your mouse pointer right where you want to insert or remove the gap. 3. Drag to the left or to the right to shift your sequences. You can edit one sequence at a time by pressing the Shift key instead of Ctrl. 143 143

English Courses Editing and Publishing Alignments perm Pairwise Alignment a pair of selected sequences 144 144

English Courses Editing and Publishing Alignments calculate tree all selected sequences 145 145

English Courses Editing and Publishing Alignments predict secondary structure a selected sequence. 146 146

English Courses Editing and Publishing Alignments JNet Secondary Structure Prediction result 147 147

English Courses Editing and Publishing Alignments save your alignment as a text/picture 148 148

English Courses Editing and Publishing Alignments Showtime has finally come: You have the multiple alignment you want, and you re determined to show the world! 149 149

English Courses Editing and Publishing Alignments Name JalView Boxshade ESPript MView URL Multiple Alignment Beautifying Tools http://www.jalview.org http://www.ch.embnet.org/software/b OX_m.html http://espript.ibcp.fr/espript/espript http://bio-mview.sourcege.net Description A multiple alignment editor written in Java Shading in black and white A very powerful shading and-coloring tool Adding optional HTML markup to control coloring and web page layout 150

English Courses exercise.fasta Can you make a MSA these 5 protein sequences? Which two sequences are the most similar ones? How similar are they? (i.e. How about their sequence identity?) What kind of proteins are they? 151