Introduction to Bioinformatics Introduction to Bioinformatics

Size: px
Start display at page:

Download "Introduction to Bioinformatics Introduction to Bioinformatics"

Transcription

1 Dr. rer. nat. Gong Jing Cancer Research Center Medicine School of Shandong University

2 Chapter 3 Alignment 2

3 Similarity Searches on Sequence Databases In the game of Mahjong Titans, you want to find the same symbol from a collection of symbols a certain one. What you can do is to compare the symbol with every one, with your eyes. 3

4 Similarity Searches on Sequence Databases For a protein or DNA sequence, similarity search means finding a similar one from a collection of sequences a query sequence. BLAST > 100,000 4

5 The Importance of Similarity Similar sequences often derive from a common ancestral sequence. They probably share similar structure and biological function. You can infer something you know about a particular DNA or protein sequence to all similar DNA or protein sequences. Similar structures Similar functions Similar sequences 5

6 The Importance of Similarity Similar sequences often derive from a common ancestral sequence. They probably share similar structure and biological function. You can infer something you know about a particular DNA or protein sequence to all similar DNA or protein sequences. Similar structure? Similar function? Brothers? 6

7 Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a nucleotide. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of identical and similar residues relative to their length. My name is Lampy. Similar or not: defined by a matrix, such as BLOSUM. 7

8 Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a nucleotide. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of identical and similar residues relative to their length. Similar or not: defined by a matrix, such as BLOSUM. seq 1 : CLHK seq 2 : CIHL Identity = 2/4 = 50% Similarity = 3/4 = 75% Identical similar seq 1 : C L H K seq 2 : C I H L seq 1 : C L H K seq 2 : C I H L 8

9 Identity and Similarity Residue: a letter; an amino acid in a protein; a base in a nucleotide. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of similar residues relative to their length. Who and who are similar, who and who not? They are defined by a matrix, such as BLOSUM. What happens when two sequences have different lengths? seq 1 : CLHKA seq 2 : CIHL Identity? Similarity? 9

10 Identity and Similarity Homologous: In general, if two protein sequences have an identity > 25%, or two DNA sequences have an identity > 70%, they can be regarded as homologous. Nothing is sure about the meaning of observed similarity. Some protein sequences are less than 15% identical, but they have the same 3D structure, while some are 25% identical, but they have different structures. Homology or non-homology is never granted. The 25% cutoff is mostly a common-sense indicator. In most cases, to make sure whether two sequences are true homologous, you need to consider many other things. Homology is a binary relationship: yes or no; similarity or identity is a quantifiable property: 0%-100%. 10

11 The Most Popular Search Tool: BLAST BLAST (Basic Local Alignment Search Tool) A sequence comparison algorithm optimized speed used to search sequence databases optimal local alignments to a query. Different kinds of BLAST (according to the query): BLASTn: Search a nucleotide database using a nucleotide query. BLASTp: Search protein database using a protein query. BLASTx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query. tblastx: Search translated nucleotide database using a translated nucleotide query. 11

12 The Most Popular Search Tool: BLAST BLAST (Basic Local Alignment Search Tool) A sequence comparison algorithm optimized speed used to search sequence databases optimal local alignments to a query. Different kinds of BLAST (according to the algorithm): standard BLAST psi-blast phi-blast 12

13 The Most Popular Search Tool: BLAST The NCBI BLAST server

14 The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp

15 The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp blast.fasta 15 15

16 The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp

17 The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp blast.fasta BLAST another sequence at the same time give a name to your job query only a part of your sequence 17 17

18 The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp select against which database you want to search 18 18

19 The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp limit the search range to a certain species, e.g. human select algorithm 19 19

20 The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp Part 1 : a brief summary 20 20

21 The Most Popular Search Tool: BLAST Part 2 : graphic summary sequence length and classification of the input protein. an overview of similar sequences 21 21

22 The Most Popular Search Tool: BLAST The NCBI BLAST server: BLASTp Part 3 : descriptions go to the corresponding database entry go to the alignment between your query sequence and the matching sequence 22 22

23 The Most Popular Search Tool: BLAST Part 4 : Alignment 23 23

24 English Courses Upgraded BLAST: PSI-BLAST Sometimes the standard BLAST is not enough. For instance, you want to catch all the members of a very large protein family, starting with one sequence that you have. When running BLAST, you catch only the most closely related sequences. The other distant members would not be found. In other words, you find your direct friends, but the friends of your friends are missing. PSI (Position-Specific Iterated)-BLAST first looks sequences that are closely related to yours; and then, gradually, it extends the circle of friends to include sequences that are distantly related. - How does PSI-BLAST extend the circle of friends? - A Position-Specific Weight Matrix and Iterations. 24

25 English Courses Position-Specific Weight Matrix A Position-Specific Weight Matrix describes the letter distribution of each position (column) a family of sequences. The distributions can be presented as probabilities or other statistic values. Seq1: A B C D Seq2: B B C D Seq3: A C C D Seq4: A B D D A 75% B 25% 75% 0 0 C 0 25% 75% 0 D % 100% 25

26 English Courses Upgraded BLAST: PSI-BLAST For the query sequence ABCD, the first round of search (first iteration) of PSI-BLAST is just like BLAST. All closely related sequences BBCD, ACCD and ABDD that have one different letter are found, but BCCD that has two different residues is missing. Then, a Position-Specific Weight Matrix is made ABCD, BBCD, ACCD and ABDD. This matrix is used in the second round of search (second iteration). Since BCCD matches the matrix, now it is found. And then, a second matrix is made ABCD, BBCD, ACCD, ABDD and BCCD. And then new sequences will be found. Iterations PSI-BLAST can detect distant evolutionary relationships, especially when the proteins returned by the first round of search are all hypothetical proteins, unknown proteins or predicted proteins. BACD BBCD BBAD BBCA BCAD BCCD BCBD ABCD ACCD ACBD BCDD ACCB CBDD ABDD ACDD ABDC 26

27 English Courses Upgraded BLAST: PSI-BLAST The NCBI BLAST server: PSI-BLAST

28 English Courses Upgraded BLAST: PSI-BLAST The NCBI BLAST server: PSI-BLAST 28

29 English Courses Upgraded BLAST: PSI-BLAST The NCBI BLAST server: PSI-BLAST

30 English Courses Upgraded BLAST: PHI-BLAST PHI (Pattern-Hit Initiated)-BLAST: in every round of BLAST (iteration), you are required to give a sequence pattern to filter the results. Only the BLAST results that match the pattern are regarded as results. Sequence pattern: [LIVMF]-G-E-x-[GAS]-[LIVM]-x(3,7) Yes: No: VGEAAMPRI VGEAAYPRI PHI-BLAST can find very exact friends. 30

31 English Courses Upgraded BLAST: PHI-BLAST The NCBI BLAST server: PHI-BLAST

32 English Courses Upgraded BLAST: PHI-BLAST PSI-BLAST BLAST PHI- BLAST Query 32

33 English Courses Similarity Searches Free over the Internet BLAST Servers around the World Location Server URL USA NCBI Europe ExPASy Europe EBI Japan DDBJ WU-BLAST - WU stands Washington University. More sensitive and more gifted at inserting gaps than NCBI-BLAST. Smith and Waterman (SSEARCH): It s slower, but more accurate than BLAST. FASTA: It s a bit slower than BLAST but more accurate when making DNA comparisons. BLAT: Use this locating cdna rapidly in a genome or finding close (mammalian vs. mammalian) proteins in a genome. 33

34 Comparing Two Sequences can help you to Convince yourself that two sequences are in fact homologous; Find out that your sequences share a domain; Identify the exact location of common features, such as disulfide bridges or catalytic active sites. English Courses English Courses Domain: a structural and functional unit in a protein. single-domain protein multiple-domain protein 34

35 English Courses Comparing Two Sequences: Dot plot Methods: dot plot, global/local alignment Dot plot is the simplest means of comparing two sequences. In fact, dot plot is the only type you can do with pencil and paper, without computer. Advantages: no biological hypothesis required; results can be analyzed with your eyes. Seq 1 T H E F A S T C A T Seq1: THEFASTCAT T x x x H x Seq2: THEFATCAT E x F x A x x T x x x length(seq1) = 10 C x length(seq2) = 9 A x x 10 x 9 = 90 comparisons T x x x Seq 2 35

36 English Courses Comparing Two Sequences: Dot plot The diagonals indicate the segments of similarity between the two sequences. 1. THEFA 2. TCAT 3. AT Seq 1 T H E F A S T C A T Seq1: THEFASTCAT T x x x H x Seq2: THEFATCAT E x F x A x x T x x x C x A x x T x x x Seq 2 36

37 English Courses Comparing Two Sequences: Dot plot You can also do dot plot one sequence to discover repeated subsequences hidden in it. Seq1: THEFASTHE Seq 1 Seq 1 T H E F A S T H E T x x H x x E x x F x A x S x T x x H x x E x x 37

38 English Courses Comparing Two Sequences: Dot plot Name Dotlet Dnadot Dotter Dottup Dot plot servers URL

39 English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet 39

40 English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet The Sequence Input Dialog dotlet.fasta 40

41 English Courses Comparing Two Sequences: Dot plot Substitution matrix, e.g. Blosum62 window size zoom The dots window will display the diagonal plot. Histogram window defines the grayscale alignment window 41

42 Use Dot Plot to detect tandem repeats in a sequence. Tandem repeat: two or more repeated units directly adjacent to each other. Example: CCCABCABCABCDDD English Courses English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet They are often used by evolution to create new proteins or make them function more efficiently. Short Tandem Repeat (STR) in DNA describes a pattern that helps determine an individual's inherited traits. A short tandem repeat polymorphism (STRP) occurs when homologous STR loci differ in the number of repeats between individuals. By identifying repeats of a specific sequence at specific locations in the genome, it is possible to create a genetic profile of an individual. There are currently over 10,000 published STR sequences in the human genome. STR analysis has become the prevalent analysis method determining genetic profiles in ensic cases. 42

43 English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet Use Dot Plot to detect tandem repeats in a sequence. Tandem repeat: two or more repeated units directly adjacent to each other. Example: CCCABCABCABCDDD C C C A B C A B C A B C D D D C x C x C x A x x x B x x x C x x x A x x x B x x x C x x x A x x x B x x x C x x x D x D x D x 43

44 English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet Use Dot Plot to detect tandem repeats in a sequence. tandem.fasta 44

45 English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet Use Dot Plot to detect tandem repeats in a sequence. 45

46 English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet Use Dot Plot to detect tandem repeats in a sequence. 1. The number of repeats is equal to the number of diagonals including the main diagonal. 2. The distance between two adjacent diagonals represents the length of the repeat. 3. The shortest diagonal gives you a single repeat unit. 46

47 English Courses Comparing Two Sequences: Dot plot Dot plot servers: Dotlet Use Dot Plot to detect tandem repeats in a sequence. Tandem repeat: two or more repeated units directly adjacent to each other. Example: CCCABCABCABCDDD C C C A B C A B C A B C D D D C x C x 1. The number of repeats is equal to C x A x x x the number of diagonals including B x x x the main diagonal. C x x x A x x x 2. The distance between two adjacent B x x x diagonals represents the length of C x x x the repeat. A x x x B x x x 3. The shortest diagonal gives you a C x x x single repeat unit. D x D x D x 47

48 English Courses Comparing Two Sequences: Alignment An alignment is an arrangement of two protein or DNA sequences to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Global alignment is most useful when the two sequences are similar and of roughly equal size. Local alignment is more useful dissimilar sequences that are suspected to contain segments of similarity. 48

49 English Courses Comparing Two Sequences: Alignment A substitution matrix BLOSUM62 gives a score every pair of amino acids, defining what is similar and how similar. 49

50 Usages of global alignment: Checking minor differences between two sequences. Analyzing polymorphisms between closely related sequences. Comparing two sequences that partly overlap. Usages of local alignment: English Courses English Courses Comparing Two Sequences: Alignment Comparing two distantly related sequences that share only a few noncontiguous domains. Analyzing repeated elements within a single sequence. 50

51 How to generate a global alignment? Input: Seq1: PYMNVI Seq2: PYELF substitution matrix (BLOSUM62) gap penalty (-1 by default ): The score of a residue vs. another residue is given by the substitution matrix; a gap penalty gives the score of a residue vs. a gap. Output: English Courses English Courses Comparing Two Sequences: Global Alignment PYMNVI PYMNVI PYMNVI-- PY-ELF or PYE-LF or PY---ELF or? 51

52 English Courses Comparing Two Sequences: Global Alignment Seq1: PYMNVI Seq2: PYELF Step 1 - P Y M N V I - P Y E L F 52

53 English Courses Comparing Two Sequences: Global Alignment Seq1: PYMNVI Seq2: PYELF Step 2 - P Y M N V I P -1 Y -2 E -3 L -4 F -5 53

54 English Courses Comparing Two Sequences: Global Alignment Seq1: PYMNVI Seq2: PYELF S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Step 3 - P Y S(3, 3) = max P Y S(2, 2) + m(s1 3, s2 3 ) = 14+(-2) = 12 S(3, 2) + gap = 13 + (-1) = 12 S(2, 3) + gap = 13 + (-1) = M N V I E L F

55 English Courses Comparing Two Sequences: Global Alignment Seq1: PYMNVI Seq2: PYELF Step 4 There is at less one path from the lower right corner to the top left corner! S(i, j) = max S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap - P Y P Y M N V I Output: seq1 PYMNVI seq2 PY-ELF ** :. E L F

56 Identity and Similarity English Courses English Courses Residue: a letter; an amino acid in a protein; a base in a DNA. Identity: If two sequences (protein or DNA) have the same length, the identity between them is defined as the percent of identical residues relative to their length. Similarity: If two sequences (protein or DNA) have the same length, the similarity between them is defined as the percent of similar residues (including identical residues) relative to their length. Who and who are similar, who and who not? They are defined by a matrix, such as BLOSUM. What happens when two sequences have different lengths? seq 1 : CVHKA seq 2 : CIHL Identity? Similarity? So far, we can define them sequences with different lengths with the help of global alignment. 56

57 English Courses Redefinition of Identity and Similarity Identity: The identity between two sequences is defined as the percent of identical residues in their global alignment. Similarity: The similarity between two sequences is defined as the percent of similar residues (including identical residues) in their global alignment. global alignment PYMNVI PY-ELF ** :. Identity = 2 / 6 = 33.3% Similarity = 3 / 6 = 50.0% 57

58 How to generate a local alignment? Input: Seq1: PYMNVI Seq2: MN substitution matrix (BLOSUM62) gap penalty (-1 by default ): The score of an arbitrary residue vs. another arbitrary residue is given in the substitution matrix; a gap penalty gives the score of an arbitrary residue vs. a gap. Output: English Courses English Courses Comparing Two Sequences: Local Alignment PYMNVI MN --MN-- or MN or? ** ** 58

59 English Courses Comparing Two Sequences: Local Alignment Seq1: PYMNVI Seq2: MN S(i, j) = max 0 S(i-1, j-1) + m(s1 i, s2 j ) S(i, j-1) + gap S(i-1, j) + gap Output: MN MN ** P 0 Y 0 M 0 N 0 V 0 I 0 M N

60 BLAST is an abbreviation of Basic Local Alignment Search Tool. In a BLAST search, how does the most similar sequence found? Is the query sequence aligned to each sequence of the entire database? No. A BLAST search among 100,000 sequences needs 2 minutes, while calculation of 100,000 alignments needs > 10,000 minutes. BLAST uses a heuristic algorithm: English Courses English Courses Making Global Alignment Over the Internet 60

61 English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: 61

62 English Courses Making Global Alignment Over the Internet EMBL Alignment Tool:

63 English Courses Making Global Alignment Over the Internet global.fasta 63

64 English Courses Making Global Alignment Over the Internet 64

65 English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: 65

66 English Courses Making Global Alignment Over the Internet 66

67 English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: small Gap Open + large Gap Extend 67

68 English Courses Making Global Alignment Over the Internet small Gap Open + large Gap Extend = dispersive gaps in alignment 68

69 English Courses Making Global Alignment Over the Internet large Gap Open + small Gap Extend = concentrative gaps in alignment 69

70 English Courses Making Global Alignment Over the Internet EMBL Alignment Tool: adjust the gap open and gap extend according to your expectation Gap Open Gap Extend 70

71 English Courses Making Local Alignment Over the Internet EMBL Alignment Tool: 71

72 English Courses Making Local Alignment Over the Internet EMBL Alignment Tool: local.fasta 72

73 EMBL Alignment Tool: English Courses English Courses Making Local Alignment Over the Internet >Seq1 SEQUENCEMHHHHHHSSGVDLGTENLYFQSMKTTQEQLKRNVRFHAFISYSEHDSLWVKNEL IPNLEKEDGSILICLYESYFDPGKSISENIVSFIEKSYKSIFVLSPNFVQNEWCHYEFYFAH HNLFHENSDHIILILLEPIPFYCIPTRYHKLKALLEKKAYLEWPKDRRKCGLFWANLRAAIN >Seq2 GTENLYFQSMKTTQEQLKRNVRFHAFISYSEHDSLWVKNELIPNLEKEDGSILICLYESYFD PGKEWCHYEFYFAHHNLFHENSDHIILILLEPIPFYCIPTRAAAAAAAAAAA 73 73

74 English Courses Different between Global and Local Alignments Global alignment Length: 186 Identity: 103/186 (55.4%) Similarity: 103/186 (55.4%) Local alignment Length: 130 Identity: 103/130 (79.2%) Similarity: 103/130 (79.2%) 74

75 English Courses Free Pairwise Alignment over the Internet Name EMBL PIR Lalign LAGAN AlignMe MCALIGN Online Pairwise Alignment Programs Alignment Type Global/Local Global Global/Local Global Alignment of Membrane Proteins alignment of non-coding DNA sequences URL rch/pairwise.shtml ALIGN_m.html ndex.shtml ignme.html calign/mcinstructions.html 75

76 English Courses Multiple Sequence Alignment A multiple sequence alignment (MSA) is a global sequence alignment of three or more sequences. 76

77 English Courses Multiple Sequence Alignment 4 main criteria building a multiple sequence alignment : Structural similarity - Amino acids that play the same role in each structure are expected in the same column. This is very difficult; only structure-superimposition programs can satisfy this criterion. Evolutionary similarity - Amino acids in the common ancestor of all the sequences are put in the same column. Indeed, no automatic program exactly uses this criterion, but they all try to respect it. Functional similarity - Amino acids with the same function are in the same column. Also, no automatic program exactly uses this criterion, but if the inmation is available, you can edit your alignment manually. Sequence similarity - Amino acids in the same column are those that yield an alignment with maximum similarity. Most programs take this, because it is the easiest criterion. 77

78 Multiple Sequence Alignment Main applications of MSA: English Courses English Courses 1. Extrapolation: whether an uncharacterized sequence is really a member of a protein family. 2. Phylogenetic analysis: the phylogenetic tree of aligned sequences can be reconstructed. 3. Pattern identification: very conserved positions with a certain function can be sent to generate sequence pattern or sequence logo. 4. Domain identification: to turn an MSA into a profile (position-specific weight matrix) that describes a protein domain. 78

79 Multiple Sequence Alignment Main applications of MSA: English Courses English Courses 5. DNA regulatory elements: to turn a DNA MSA of a binding site into a profile and scan other DNA sequences potential binding sites. 6. Structure prediction: to predict protein/rna secondary structures by similarity. 7. nssnp analysis: MSA can help you predict whether a non-synonymous single-nucleotide polymorphism is likely to be harmful. 8. PCR analysis: a good multiple alignment can help you identify the less degenerated portions of a protein family, in order to fish out new members by PCR (polymerase chain reaction). 79

80 English Courses Choosing the Right Sequences MSA is not an arbitrary group of sequences. Instead, the sequences should be members of the same protein family, and they all share a common ancestor. 80

81 Choosing the Right Sequences Naming sequences in the right way: Never use white spaces in your sequence names. Use the underline (_) to replace spaces. e.g. My Seq 1 My_Seq_1 Do not use special symbols. (such as Chinese #, &, ^ etc.). e.g. 我的序列壹 English Courses English Courses Seq1@li.com Never use names longer than 15 characters. e.g. This_is_my_favorite_sequence_about_mouse Never give the same name to two different sequences in your set. If you don t obey these naming rules, some MSA programs may automatically change the name of your sequences, without telling you. 81

82 English Courses Choosing the Right Sequences Choosing the right number of sequences: start with a relatively small number of sequence (10-15) increase its size, after you get something interesting happening with this small set. In any case, it s hard to see any reason generating a MSA with > 50 sequences. If you start with hundreds of sequences, you immediately hit troubles: Computing big alignments is difficult. Building big alignments is difficult. Displaying big alignments is difficult. Using big alignments is difficult. Making accurate big alignments is difficult. 82

83 English Courses The most commonly used MSA packages. Bee you start making multiple sequence alignments, you must know that none of the methods available today is perfect. They all use approximations. seq1 P Y M N V I seq P Y E L F seq2 seq1 seq2 2 sequences = 2D 3 sequences = 3D n sequences = nd 83

84 English Courses The most commonly used MSA packages. ClustalW - the most commonly used MSA package. Tcoffee - one of the latest MSA packages. MUSCLE - one of the fastest alignment methods. 84

85 English Courses The most commonly used MSA packages. ClustalW is the latest of the Clustal software series. Clustal was the first multiple sequence alignment program. These days, with more than 35,000 citations, ClustalW is one of the most widely cited scientific publications in the history of biology. ClustalW uses a progressive algorithm. This means that it adds sequences one by one, instead of aligning all the sequences at the same time. 85

86 English Courses The most commonly used MSA packages. Name EBI PIR EMBnet BCM GenomeNet DDBJ Strasbourg Location Europe USA Europe USA Japan Japan Europe A List of ClustalW Servers URL multialn.shtml alw.html /ClustalW 86

87 English Courses The most commonly used MSA packages. EMBL ClustalW

88 English Courses The most commonly used MSA packages. EMBL ClustalW msa.fasta Human TLR1-10 s TIR domains 88

89 English Courses The most commonly used MSA packages. EMBL ClustalW 89

90 English Courses The most commonly used MSA packages. EMBL ClustalW 90

91 English Courses The most commonly used MSA packages. EMBL ClustalW The sequences in the alignment are sorted by the pairwise identity. 91

92 English Courses The most commonly used MSA packages. EMBL ClustalW Red: hydrophobic Blue: Acidic Magenta: Basic Green: Hydroxyl + Amine + Basic Gray: Others 92

93 English Courses The most commonly used MSA packages. EMBL ClustalW * Asterisk - an entirely conserved column. : Double-dot - columns where all the residues have roughly the same size and the same hydropathy.. Single-dot - columns where the size or the hydropathy has been preserved in the course of evolution. 93

94 English Courses The most commonly used MSA packages. EMBL ClustalW 94

95 English Courses The most commonly used MSA packages. EMBL ClustalW The guide tree is NOT a phylogenetic tree! 95

96 English Courses The most commonly used MSA packages. Tcoffee is a recent method developed conducting multiple sequence alignments. It uses a principle that s a bit similar to ClustalW, but it yields more accurate alignments at the cost of a slightly longer running time. Tcoffee builds a progressive alignment like ClustalW, but it compares segments across the entire sequence set. Home page :

97 English Courses The most commonly used MSA packages. Name SIB EBI CNRS Max-Planck CBSU EMBnet T-Coffee Mirror sites URL index.cgi

98 English Courses The most commonly used MSA packages. Aside from its accuracy, the main specificity of Tcoffee is its ability to align sequences and structures (EXPRESSO), the possibility of evaluating the accuracy of an alignment (CORE) and the possibility of combining many alternative multiple sequence alignments into one (Mcoffee). Usage TCOFFEE CORE MCOFFEE EXPRESSO Available Tools on Description Produce a multiple sequence alignment with Tcoffee. Evaluate the reliability of an existing multiple alignment Run any requested Multiple sequence Alignment package and combine all the output into one final alignment. Incorporate all the available structural inmation in your alignment. Will produce the best sequence alignments if the structures are available. 98

99 English Courses The most commonly used MSA packages. T-Coffee

100 English Courses The most commonly used MSA packages. msa.fasta Human TLR1-10 s TIR domains

101 English Courses The most commonly used MSA packages

102 English Courses The most commonly used MSA packages. T-Coffee

103 English Courses The most commonly used MSA packages. T-Coffee fasta_aln file score_html file phylip file clustalw_aln file

104 English Courses The most commonly used MSA packages. T-Coffee When you choose to store your data in a specific mat, you must ask yourself four questions: Do most programs support this mat? Will my collaborators be able to use it? Can I store all the inmation I need with this mat? Is it easy to manipulate? If the program you re using doesn t produce alignments in the mat you need, it is possible to use a third-party conversion tool to get the mat you want. fmtseq : or 104

105 English Courses The most commonly used MSA packages. T-Coffee 105

106 English Courses The most commonly used MSA packages. T-Coffee EXPRESSO is the latest development of Tcoffee, replacing what was known as 3D-Coffee. When you run Expresso, the program uses BLAST to search the PDB structures whose sequences are similar to your sequences. It then uses theses structures to guide the alignment. Alignments based on structures are expected to be much more accurate than simple sequence alignments

107 English Courses The most commonly used MSA packages. T-Coffee

108 English Courses The most commonly used MSA packages. T-Coffee EXPRESSO T-Coffee 108

109 English Courses The most commonly used MSA packages. T-Coffee PDB ID 109

110 English Courses The most commonly used MSA packages. MUSCLE - is a newcomer in the MSA area but it is a remarkably efficient package making fast, high-quality multiple sequence alignments. MUSCLE is ideal if you want to align several hundreds sequences. Home page : m/muscle 110

111 English Courses The most commonly used MSA packages. MUSCLE 111

112 English Courses Searching conserved patterns One sentence summarizes what you really want from your multiple alignment: You want to identify important positions! 112

113 Searching conserved patterns Sequence Logos: WebLogo English Courses English Courses Sequence logos - are a graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneider and Mike Stephens. Each logo consists of stacks of symbols, one stack each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position. In general, a sequence logo provides a richer and precise description of, example, a binding site. 113

114 English Courses Searching conserved patterns Sequence Logos: WebLogo WebLogo - is a web based application designed to make the generation of sequence logos easy and painless. WebLogo has featured in over 150 scientific publications

115 English Courses Searching conserved patterns Sequence Logos: WebLogo promoter.seqs

116 English Courses Searching conserved patterns Sequence Logos: WebLogo 116

117 English Courses Searching conserved patterns Sequence Logos: WebLogo

118 English Courses Searching conserved patterns Sequence Logos: WebLogo In the promoter region of genes, we usually found a special fragment, called TATA box (also called Goldberg-Hogness box). The TATA box has the core DNA sequence 5'-TATAAT-3' or a variant. It is usually found as the binding site of RNA polymerase

119 Searching conserved patterns Sequence Motif - a nucleotide or amino-acid sequence pattern that is widespread and has a biological significance. Example: N-glycosylation site motif English Courses English Courses Sequence Motifs: MEME Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro This pattern can be written as N{P}[ST]{P}(Regular expression), where N=Asn, P=Pro, S=Ser, T=Thr; {X} means any amino acid except X; and [XY] means either X or Y. The notation [XY] does not give any indication of the probability of X or Y occurring in the pattern. Observed probabilities can be graphically represented using sequence logos. 119

120 English Courses Searching conserved patterns Sequence Motifs: MEME The MEME Suite - Motif-based sequence analysis tools. The MEME Suite allows you to: discover motifs on groups of related DNA or protein sequences, search sequence databases using motifs, compare a motif to all motifs in a database of motifs. Home page : 120

121 English Courses Searching conserved patterns Sequence Motifs: MEME 121

122 English Courses Searching conserved patterns meme.seqs 122

123 English Courses Searching conserved patterns Sequence Motifs: MEME

124 English Courses Searching conserved patterns Sequence Motifs: MEME

125 English Courses Searching conserved patterns Sequence Motifs: MEME

126 English Courses Searching conserved patterns Sequence Motifs: MEME

127 English Courses Searching conserved patterns Sequence Motifs: MEME One sentence summarizes what you really want from your multiple alignment: You want to identify important positions!

128 English Courses Searching conserved patterns Human TLR 1-TIR Human TLR 2-TIR Human TLR 10-TIR BB-Loop BB-Loop - is important the TIR domain dimerization and interaction with downstream adaptors or inhibitors

129 English Courses Editing and Publishing Alignments fasta_aln file score_html file phylip file clustalw_aln file

130 Editing and Publishing Alignments For editing and publishing a multiple sequence alignment, bioinmaticans have developed text editors that are specific multiple sequence alignment. They make it easy you to see exactly what s going on. Most of these editors require the installation of something on your computer. However, if you want to stick to your browser, you can use Jalview. Jalview is a Java applet that you need only load into your Web browser instant action. Home page : English Courses English Courses Do not load confidential sequences! Web interface is NOT secure

131 English Courses Editing and Publishing Alignments EMBL ClustalW 131

132 English Courses Editing and Publishing Alignments Jalview 132

133 English Courses Editing and Publishing Alignments Jalview 133

134 English Courses Editing and Publishing Alignments Jalview run 134

135 English Courses Close ALL the windows that appear within the Jalview Window, as they only contain sample data

136 English Courses Editing and Publishing Alignments Jalview results.clustalw

137 English Courses Editing and Publishing Alignments

138 Editing and Publishing Alignments Jalview English Courses English Courses

139 English Courses Editing and Publishing Alignments Jalview

140 English Courses Editing and Publishing Alignments Jalview Colour -> Clustalx

141 English Courses Editing and Publishing Alignments Jalview Colour -> Clustalx

142 English Courses Editing and Publishing Alignments When you edit an alignment, you usually want to do is collectively modify the alignment. To do this, you need to define them as a group, as follows: Keep the Ctrl key pressed while you click names of sequences 1, 2, 3 and 4 to select them

143 English Courses Editing and Publishing Alignments 1. Keep the Ctrl key pressed. 2. Put your mouse pointer right where you want to insert or remove the gap. 3. Drag to the left or to the right to shift your sequences. You can edit one sequence at a time by pressing the Shift key instead of Ctrl

144 English Courses Editing and Publishing Alignments perm Pairwise Alignment a pair of selected sequences

145 English Courses Editing and Publishing Alignments calculate tree all selected sequences

146 English Courses Editing and Publishing Alignments predict secondary structure a selected sequence

147 English Courses Editing and Publishing Alignments JNet Secondary Structure Prediction result

148 English Courses Editing and Publishing Alignments save your alignment as a text/picture

149 English Courses Editing and Publishing Alignments Showtime has finally come: You have the multiple alignment you want, and you re determined to show the world!

150 English Courses Editing and Publishing Alignments Name JalView Boxshade ESPript MView URL Multiple Alignment Beautifying Tools OX_m.html Description A multiple alignment editor written in Java Shading in black and white A very powerful shading and-coloring tool Adding optional HTML markup to control coloring and web page layout 150

151 English Courses exercise.fasta Can you make a MSA these 5 protein sequences? Which two sequences are the most similar ones? How similar are they? (i.e. How about their sequence identity?) What kind of proteins are they? 151

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics http://1.51.212.243/bioinfo.html Dr. rer. nat. Jing Gong Cancer Research Center School of Medicine, Shandong University 211.1.12 Chapter 3 Alignment Similarity Searches on

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Using Bioinformatics to Study Evolutionary Relationships Instructions

Using Bioinformatics to Study Evolutionary Relationships Instructions 3 Using Bioinformatics to Study Evolutionary Relationships Instructions Student Researcher Background: Making and Using Multiple Sequence Alignments One of the primary tasks of genetic researchers is comparing

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Introduction to Bioinformatics Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Dr. rer. nat. Gong Jing Cancer Research Center Medicine School of Shandong University 2012.11.09 1 Chapter 4 Phylogenetic Tree 2 Phylogeny Evidence from morphological ( 形态学的 ), biochemical, and gene sequence

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Multiple sequence alignment

Multiple sequence alignment Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple

More information

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare

More information

Heuristic Alignment and Searching

Heuristic Alignment and Searching 3/28/2012 Types of alignments Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Open a Word document to record answers to any italicized questions. You will the final document to me at

Open a Word document to record answers to any italicized questions. You will  the final document to me at Molecular Evidence for Evolution Open a Word document to record answers to any italicized questions. You will email the final document to me at tchnsci@yahoo.com Pre Lab Activity: Genes code for amino

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Overview Multiple Sequence Alignment

Overview Multiple Sequence Alignment Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Protein function prediction based on sequence analysis

Protein function prediction based on sequence analysis Performing sequence searches Post-Blast analysis, Using profiles and pattern-matching Protein function prediction based on sequence analysis Slides from a lecture on MOL204 - Applied Bioinformatics 18-Oct-2005

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Sequence Analysis and Databases 2: Sequences and Multiple Alignments 1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es) 2 Sequence Comparisons:

More information

Hands-On Nine The PAX6 Gene and Protein

Hands-On Nine The PAX6 Gene and Protein Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Basics on bioinforma-cs Lecture 7. Nunzio D Agostino

Basics on bioinforma-cs Lecture 7. Nunzio D Agostino Basics on bioinforma-cs Lecture 7 Nunzio D Agostino nunzio.dagostino@entecra.it; nunzio.dagostino@gmail.com Multiple alignments One sequence plays coy a pair of homologous sequence whisper many aligned

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Journal of Proteomics & Bioinformatics - Open Access

Journal of Proteomics & Bioinformatics - Open Access Abstract Methodology for Phylogenetic Tree Construction Kudipudi Srinivas 2, Allam Appa Rao 1, GR Sridhar 3, Srinubabu Gedela 1* 1 International Center for Bioinformatics & Center for Biotechnology, Andhra

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

Pairwise sequence alignments

Pairwise sequence alignments Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

More information

Investigating Evolutionary Questions Using Online Molecular Databases *

Investigating Evolutionary Questions Using Online Molecular Databases * Investigating Evolutionary Questions Using Online Molecular Databases * Adapted from Puterbaugh and Burleigh, and The American Biology Teacher Lesson Background and Overview [Student Information Handout]

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology ICB Fall 2003 G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2003 Oliver Jovanovic, All Rights Reserved. Bioinformatics and

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple equence lignment Four ami Khuri Dept of omputer cience an José tate University Multiple equence lignment v Progressive lignment v Guide Tree v lustalw v Toffee v Muscle v MFFT * 20 * 0 * 60 *

More information

Session 5: Phylogenomics

Session 5: Phylogenomics Session 5: Phylogenomics B.- Phylogeny based orthology assignment REMINDER: Gene tree reconstruction is divided in three steps: homology search, multiple sequence alignment and model selection plus tree

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel ) Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

More information

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES HOW CAN BIOINFORMATICS BE USED AS A TOOL TO DETERMINE EVOLUTIONARY RELATIONSHPS AND TO BETTER UNDERSTAND PROTEIN HERITAGE?

More information

Molecular Evolution and DNA systematics

Molecular Evolution and DNA systematics Biology 4505 - Biogeography & Systematics Dr. Carr Molecular Evolution and DNA systematics Ultimately, the source of all organismal variation that we have examined in this course is the genome, written

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

Preparing a PDB File

Preparing a PDB File Figure 1: Schematic view of the ligand-binding domain from the vitamin D receptor (PDB file 1IE9). The crystallographic waters are shown as small spheres and the bound ligand is shown as a CPK model. HO

More information

Chapter 11 Multiple sequence alignment

Chapter 11 Multiple sequence alignment Chapter 11 Multiple sequence alignment Burkhard Morgenstern 1. INTRODUCTION Sequence alignment is of crucial importance for all aspects of biological sequence analysis. Virtually all methods of nucleic

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Synteny Portal Documentation

Synteny Portal Documentation Synteny Portal Documentation Synteny Portal is a web application portal for visualizing, browsing, searching and building synteny blocks. Synteny Portal provides four main web applications: SynCircos,

More information

Tree Building Activity

Tree Building Activity Tree Building Activity Introduction In this activity, you will construct phylogenetic trees using a phenotypic similarity (cartoon microbe pictures) and genotypic similarity (real microbe sequences). For

More information

Introduction to Evolutionary Concepts

Introduction to Evolutionary Concepts Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Tutorial 4 Substitution matrices and PSI-BLAST

Tutorial 4 Substitution matrices and PSI-BLAST Tutorial 4 Substitution matrices and PSI-BLAST 1 Agenda Substitution Matrices PAM - Point Accepted Mutations BLOSUM - Blocks Substitution Matrix PSI-BLAST Cool story of the day: Why should we care about

More information

Collected Works of Charles Dickens

Collected Works of Charles Dickens Collected Works of Charles Dickens A Random Dickens Quote If there were no bad people, there would be no good lawyers. Original Sentence It was a dark and stormy night; the night was dark except at sunny

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations Sequence Analysis and Structure Prediction Service Centro Nacional de Biotecnología CSIC 8-10 May, 2013 Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations Course Notes Instructor:

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

More information

Emily Blanton Phylogeny Lab Report May 2009

Emily Blanton Phylogeny Lab Report May 2009 Introduction It is suggested through scientific research that all living organisms are connected- that we all share a common ancestor and that, through time, we have all evolved from the same starting

More information

Moreover, the circular logic

Moreover, the circular logic Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi) Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction Lesser Tenrec (Echinops telfairi) Goals: 1. Use phylogenetic experimental design theory to select optimal taxa to

More information

Introduction to protein alignments

Introduction to protein alignments Introduction to protein alignments Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Molecular Biology-2018 1 Definitions: RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information