Bioinformatics Workshop - NM-AIST

Size: px

Start display at page:

Download "Bioinformatics Workshop - NM-AIST"

Victoria May
6 years ago
Views:

1 Bioinformatics Workshop - NM-AIST Day 1 Sequence Alignments and Searching Thomas Girke July 23, 2012 Day 1, Sequence Alignments and Searching Slide 1/80

2 Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Outline Slide 2/80

3 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 3/80

4 What is Bioinformatics? Genomics The study of entire genomes with modern large-scale analytical techniques. Often the term is extended to other genome-wide omics study areas, like transcriptomics, proteomics, metabolomics, structural genomics, etc. Bioinformatics The analysis of biological information using computational and statistical techniques. This includes data sets from all genomics and other genome-wide study areas. Focus of this Workshop Provides a broad overview of important bioinformatics approaches. This includes genome sequencing, database techniques, structural, comparative and evolutionary genomics, microarray, next generation sequence and small molecule analysis. Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 4/80

5 Bioinformatics = Inferring Knowledge from Data Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 5/80

6 Organization and Structure of Genomes Genomes Chromosomes Genes mrna Proteins Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 6/80

7 From Genomes to Genes Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 7/80

8 Genome Sizes Organism Genome Size (bp) Genes Phage Phi-X 174 (virus) Escherichia coli (bacterium) ,000 Saccharomyces cerevisiae (yeast) ,000 Arabidopsis thaliana (plant) ,000 Fritillaria assyrica (plant) NA Caenorhabditis elegans (nematode) ,500 Homo sapiens ,000 Only approximate numbers are given in this table. The numbers usually vary with the quality of the genome annotations. The most accurate numbers can be found in genome databases (e.g. NCBI). Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 8/80

Functions: DNA packaging in mitosis and meiosis, and to serve as a mechanism to

9 Chromatin Chromatin: complex of DNA and protein that makes up chromosomes. Histones are the major protein component. Functions: DNA packaging in mitosis and meiosis, and to serve as a mechanism to control expression and DNA replication. Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 9/80

Nucleosome Structure Nucleosome crystal structure Luger1997a Nucleosome core particle: 147 base pairs of DNA wrapped around a histone octamer consisting of two times of each histone: H2A, H2B,

10 Nucleosome Structure Nucleosome crystal structure Luger1997a Nucleosome core particle: 147 base pairs of DNA wrapped around a histone octamer consisting of two times of each histone: H2A, H2B, H3 and H4. Histone H1 links different nucleosomes together to compact chromatin structure. Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 10/80

11 From Genes to mrnas to Proteins to Compounds Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 11/80

12 Genetic Code and Amino Acids Name Abrev Codons Properties Alanine Ala A GCT GCC GCA GCG nonpolar Arginine Arg R CGT CGC CGA CGG AGA AGG pos charged Asparagine Asn N AAT AAC polar -CONH2 Aspartic acid Asp D GAT GAC neg charged Cysteine Cys C TGT TGC polar -SH Glutamine Gln Q CAA CAG polar -CONH2 Glutamic acid Glu E GAA GAG neg charged Glycine Gly G GGT GGC GGA GGG nonpolar Histidine His H CAT CAC pos charged Isoleucine Ile I ATT ATC ATA nonpolar Leucine Leu L TTA TTG CTT CTC CTA CTG nonpolar Lysine Lys K AAA AAG pos charged Methionine Met M ATG (START codon) nonpolar Phenylalanine Phe F TTT TTC nonpolar Proline Pro P CCT CCC CCA CCG nonpolar Serine Ser S TCT TCC TCA TCG AGT AGC polar -OH Threonine Thr T ACT ACC ACA ACG polar -OH Tryptophan Trp W TGG nonpolar Tyrosine Tyr Y TAT TAC polar -OH Valine Val V GTT GTC GTA GTG nonpolar STOP TAA TAG TGA Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 12/80

13 Terminology of Gene Elements and Important Processes Genome: hereditary information of organism encoded in DNA (RNA); contains gene and intergenic regions Gene: transcribed region; coding and non-coding genes (e.g. mirna) mrna: messenger RNA TU: transcriptional unit UTR: untranslated region ORF: open reading frame reaching from START to STOP codon CDS: coding sequence Promoter: regulatory region controlling gene expression Transcription: formation of mrna Splicing: removal of introns converts pre-mrna to mrna Translation: conversion of mrna to protein Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 13/80

14 Important Regulatory Processes Transcriptional control: promoter elements (DNA), regulatory RNAs, transcription factors (proteins) Post-transcriptional control: mrna turnover and availability. Translational control: translational control factors (proteins and RNAs) Post-translational control: protein activity modulations by modifications (e.g. phosphorylation by kinases) Control by small RNAs: mirnas and sirnas controlling gene expression on transcriptional and translational levels Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 14/80

15 What is Measured by Omics Technologies? Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 15/80

16 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching DNA Sequencing Basics Slide 16/80

17 What are we sequencing? DNA sequencing is more efficient, because: DNA cloning and amplification is easy. Availability of efficient enzymatic sequencing reactions. Protein sequencing is much harder, because: No cloning or amplification techniques are available. Limited availability of enzymatic sequencing techniques. Chemical nature of proteins makes sequencing difficult. Day 1, Sequence Alignments and Searching DNA Sequencing Basics Slide 17/80

18 DNA Libraries A DNA library consists of cloned DNA fragments that can represent the entire genome of an organism (genomic DNA library) or its mrna sequences only (cdna library). Genomic library Contains often entire DNA content of an organism. Suitable for determining genomic DNA sequence. Requires chromosomal DNA isolation. cdna library Contains the mrnas that are expressed in a tissue sample. mrna is used as starting material mrna needs to be reverse transcribed into cdna Requires mrna isolation Challenges: cdna libraries tend to be incomplete with regard to: 5 sequences Representation of all genes in the genome Day 1, Sequence Alignments and Searching DNA Sequencing Basics Slide 18/80

19 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching DNA Sequencing Basics Traditional DNA Sequencing Slide 19/80

Chemical Sequencing by Maxam & Gilbert 1 Uses radioactive labeled DNA fragments of 500 bp. 2 Four separate chemical treatments generate DNA breaks at the positions: G, A+G, C, C+T.

20 Chemical Sequencing by Maxam & Gilbert 1 Uses radioactive labeled DNA fragments of 500 bp. 2 Four separate chemical treatments generate DNA breaks at the positions: G, A+G, C, C+T. 3 The fragments are size-separated by gel electrophoresis in four separate lanes. 4 Visualization of the fragments by autoradiography on an X-ray film. Chemical DNA Degradation Gel Electrophoresis Day 1, Sequence Alignments and Searching DNA Sequencing Basics Traditional DNA Sequencing Slide 20/80

21 Sanger Dideoxy or Chain-Termination Sequencing Is based on an enzymatic DNA elongation reaction that is randomly terminated with dideoxynucleotides. Reaction mix: Single-stranded DNA template, DNA primer, DNA polymerase, radioactively or fluorescently labeled nucleotides Dideoxynucleotides to terminate the DNA strand elongation Day 1, Sequence Alignments and Searching DNA Sequencing Basics Traditional DNA Sequencing Slide 21/80

22 Required Sequencing Reactions for Human Genome 3, 000, 000, 000/800 = 3, 750, 000 dye-terminator sequencing reactions with a read length of 800 bp are necessary to obtain a sequence set with the same number of bases as the human genome (3,000,000,000 bp). The 3,750,000 sequences of 800 bp provide less than a one-time coverage of the genome. At least 10-times coverage is necessary to complete a genome sequence, which totals to 37,500,000 sequences of 800 bp. With a 384 capillary sequencer one can obtain the required sequences with 37, 500, 000/384 = 97, runs. Assuming one sequencing run per day, it will take with a single capillary sequencer 97, /365 = 268 years to finish the human genome or 268 sequencers to finish it in one year. The cost of a capillary sequencer is around $500,000. Including reagent and labor cost, the human genome sequence was a $1.5-3 billion investment. Day 1, Sequence Alignments and Searching DNA Sequencing Basics Traditional DNA Sequencing Slide 22/80

23 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 23/80

24 Example: Illumina/Solexa Technology Illumina Sequencer Flow Cell Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 24/80

25 Basic Steps of Illumina/Solexa Sequencing Technology Flow Cell Loading Compare with illustration on next three slides! 1 Generate DNA library (genomic- or cdna-based) with insert length of 200 bp. 2 Load library onto flow cell (nano device for liquid handling). 3 PCR-based bridge amplification of loaded fragments to obtain DNA clusters (serves signal amplification) Sequencing Cycles 4 Start reversible dye-terminator reaction containing primer and labeled dntps among other components. 5 Image scan to detect the identity of first base of each cluster via the characteristic fluorescence signal for each labeled nucleotide. 6 De-protection step removes the blocking group and fluorescence group of the incorporated nucleotide. 7 Repeat steps 4-6 about times. Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 25/80

26 Loading of Flow Cell Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 26/80

27 Sequencing Cycles Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 27/80

28 Processing of Sequencing Raw Data Assign quality score to image intensity values or peak areas The frequently used Phred scores provide log(10)-transformed error probability values: score = 20 corresponds to a 1% error rate score = 30 corresponds to a 0.1% error rate score = 40 corresponds to a 0.01% error rate The base calling (A, T, G or C) is performed based on Phred scores. Ambiguous positions with Phred scores 20 often assigned N. Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 28/80

29 Sequence Assembly Genomic Library Sequencing Shotgun Sequences Assembly by Alignment Contigs Scaffolds Chromosome NNN Contig Sequences Additional Data NNN Finishing Steps Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 29/80

30 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 30/80

31 Illustration of Sequence Alignment Process Goal: maximize number of identical and similar residues in columns of alignment. Unaligned Sequences P10632 P08686 Aligned Sequences LKNLNTTAVFMPFSAGKRICAGEGLARMELFGGLFLTTILQNFNLKSVDD LAFGCGARVCLGEPLARLELFVVLTRLLQAFTLLPSGD Identify alignment start Insert gaps if necessary P10632 LKNLNTTAVFMPFSAGKRICAGEGLARMELFGGLFLTTILQNFNLKSVDD P LAFGCGARVCLGEPLARLELF..VVLTRLLQAFTLLPSGD consensus...f..g.r.c.ge.lar.elf...lt..lq.f.l...d 2 logo 1 LKNLNTTAVFL M A P FG S A C GA K RI V CA L GEG P LARL M ELFGGL VVLTR F TL I LQ GD 2 NFN T A LK L P SV 1 SD Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 31/80

32 Why Comparing Sequences? Background The evolution of biological sequences is mainly driven by gene duplications, point mutations, insertions and deletions. Alignment algorithms are the central tool to detect these events and to perform sequence similarity analyses in general. Utilities Functional analyses: Conserved sequence regions are functionally important. Evolutionary analyses: Sequences divergence patters can be used to reconstruct their phylogenetic relationships. Mutation and SNP analyses Comparative genomics Sequence similarity searching is based on alignment methods. Many other utilities Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 32/80

33 Why Gapped Alignments? Sequences evolve by complex mutation processes Gene duplications* Gene deletions* Point mutations Substitutions Insertions* Deletions* *require gaps Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 33/80

34 Sequence Alignment Concepts String Matching (lacks gaps of alignment approach) Global Pairwise Alignment Local Pairwise Alignment Multiple Alignment H E A G A W G H E E A W G H E H E A G A W G H E - E - - P - A W - H E A E A W G H E A W - H E H E A G A W G H E - E H D A C A W G H E - E H D A C - W G H E - E H D - C S T G H E - E - - P - A W - H E A E Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 34/80

35 Important Steps in Pairwise Alignment Process 1 Type of alignment 2 Scoring system 3 Algorithm to find best scoring alignment 4 Statistics to evaluate significance of alignment score Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 35/80

36 Best Sequence Type for Alignments Divergent Sequences If available, use protein sequences, because of higher information content, better scoring system, reliability of alignment, functional constraints, etc. Similar Sequences Protein or DNA sequence depending on analysis needs. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 36/80

37 Scoring Parameters for Alignments Substitution matrix Empirically determined rates at which one residue in a sequence changes to another residue over time. These substitution rates are typically expressed as: Log-Odds Scores s i,j = log p i M i,j p i p j observed frequency = log expected frequency M i,j = probability of AA i transforming into AA j ; p i = frequency of AA i. Gap Opening Penalty Penalty score for gap insertion. Often severe value to minimize the number of gaps. Gap Extension Penalty Penalty score for gap extension. Often severe value to minimize the length of gaps. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 37/80

38 Scoring or Substitution Matrices BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) Based on functional model for analyzing divergent protein sequences [Henikoff & Henikoff 1992]. The log-odds scores were obtained from the substitution probabilities in conserved and gap-less regions of protein families in the BLOCKS database - one for each of the possible substitutions of the 20 standard amino acids. Matrices with low values (e.g. BLOSUM50) are for divergent sequences, and matrices with high values (e.g. BLOSUM80) for more related sequences. PAM (Point Accepted Mutation Matrix) Based on evolutionary model for analyzing protein sequences [Dayhoff et al 1978]. The mutations are considered throughout the global alignment in conserved and unconserved regions of many well studied protein families. PAM matrices with higher numbers are for studying evolutionary distant sequences, while PAMs with smaller numbers are for more related sequences (opposite in BLOSUM matrices). Matrices for DNA and RNA alignments Often simple scoring matrices are used where matches have a positive match score, mismatches a negative mismatch score, and gaps a negative gap penalty. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 38/80

39 Substitution Matrix BLOSUM50 (NCBI Matrix Download) A R N D C Q E G H I L K M F P S T W Y V B Z X A R N D C Q E G H I L K M F P S T W Y V B Z X B (Asx): aspartic acid, asparagine; Z (Glx): glutamic acid, glutamine; X (Xaa): other amino acid Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 39/80

40 Gap Penalties To build high-quality alignments, it is important to control the number and the length of the gaps that are introduced during the alignment building process. This is achieved by selecting gap penalties for the alignment scoring process which are usually negative values. Constant gap penalty Every gap receives the same penalty independent of its size. Linear gap penalty Linear gap penalties have only parameter (d) which is linear to the length of the gap. Disadvantage: the overall penalty for one large gap is the same as for many small gaps that add up to the same length. Affine gap open and extension penalties [most commonly used!] Attempts of overcome the problem of the linear gap penalty by using a gap opening penalty (o) and a gap extension penalty (e). Their values are often set so that gap insertions are discouraged and longer gaps are favored over many short gaps. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 40/80

41 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 41/80

42 Global Alignment: Needleman-Wunsch Algorithm Initial algorithm [Needleman and Wunsch 1970] Improved version [Gotoh 1982] Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 42/80

43 Dynamic Programming Alignment Algorithm Impossible to calculate all possible alignments The number of possible alignments: (2n)! (n!) 2 22n πn n = length of both sequences Solution: dynamic programming algorithm Algorithm for finding an optimal alignment between two sequences with an additive scoring system. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 43/80

44 Main Steps in Dynamic Programming Alignment Algorithms Recurrence rules: dynamic programming matrix Boundary conditions: gaps, termination and extensions Traceback step: optimal alignment Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 44/80

45 Global Sequence Alignment Algorithm The dynamic programming approach builds an optimal alignment stepwise by adding solutions of optimal sub-alignments. This is achieved with the following steps: 1.1 Construct dynamic programming matrix F (i, j) where the rows i and columns j represent the residues of the two sequences. 2.1 Fill the matrix from the top left to bottom right with the largest score of three possible substitution solutions. First cell is initialized with F (0, 0) = 0. F (i 1, j 1) + s(x i, y i ), F (i, j) = max F (i 1, j) d, F (i, j 1) d. F (i, j) = additive substitution score of each sub-solution d = gap score (gap penalty) 2.2 Boundary rows and columns are filled with: F (i, 0) = id and F (0, j) = jd. 2.3 Apply operation repeatedly to bottom right corner of each square of four cells (see next slide). 2.4 In each step store a pointer for each cell back to the cell from which F (i, j) was derived. 2.5 Value in final bottom right cell is the final score of the alignment 3.1 Generate alignment by traceback method: starting from final cell move back along the stored pointers and align residues according to movement directions. - Diagonal movement: align corresponding residues - Up or left movement: insert gaps accordingly Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 45/80

46 Three Possibilities : Align, Insert or Delete F (i 1, j 1) F (i, j 1)...s(i,j) -d F (i 1, j)... -d F(i,j) F (i, j) = largest of 3 possible solutions Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 46/80

47 Dynamic Programming Matrix: Global Alignment Substitution matrix: BLOSUM50 Gap opening and extension penalties: 8 H E A G A W G H E E P A W H E A E H E A G A W G H E - E - - P - A W - H E A E Final Score = 1 Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 47/80

48 Complexity of Algorithm Time and memory cost: O(nm) nm: product of length of two sequences O: of order nm Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 48/80

49 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 49/80

50 Local Alignment: Smith-Waterman Algorithm Often more important than global alignment, because related sequences show frequently only local similarities. Initial algorithm [Smith and Waterman 1981] Improved version [Gotoh 1982] Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 50/80

51 Local Sequence Alignment Algorithm Algorithm closely related to global alignment approach, with the following modifications: 1.1 Same as global alignment. 2.1 One more possible solution is added to F (i, j). If all other solutions are less than zero then F (i, j) will be set to 0: 0, F (i 1, j 1) + s(x i, y i ), F (i, j) = max F (i 1, j) d, F (i, j 1) d. F (i, j) = additive substitution score of each sub-solution d = gap score (gap penalty) 2.2 Boundary rows and columns are filled with zeros. 2.3 Taking the value zero corresponds to starting a new alignment.* 2.4 Alignments can start and end anywhere in the matrix. 3.1 Best local alignment: traceback from highest score in matrix until first cell with zero is reached. Value in initial traceback cell is alignment score. Important requirement: random matches must receive negative values by scoring system, otherwise long unrelated matches would mask significant local matches! Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 51/80

52 Dynamic Programming Matrix: Local Alignment Substitution matrix: BLOSUM50 Gap opening and extension penalties: 8 H E A G A W G H E E P A W H E A E A W G H E A W - H E Score of highest ranking alignment = 28 Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 52/80

53 Additional Algorithms for Pairwise Alignments Repeat alignment algorithm Algorithm for obtaining all non-overlapping local alignments with significant scores. Maximum overlap match algorithm Global alignment algorithm without penalizing overhanging ends like in sequence assembly problem. Algorithms for allowing long gaps Aligning cdna (no introns) to genomic DNA (introns). Many more algorithms for specific alignment problems Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 53/80

54 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 54/80

55 Multiple Sequence Alignments Example: Five Human P450 Hydroxylases P10632 ETTSTTLRYGLLLLLKHPEVTAKVQEEIDHVIGRHRSPCM...QDRSHMPYTDAV 351 P11712 ETTSTTLRYALLLLLKHPEVTAKVQEEIERVIGRNRSPCM...QDRSHMPYTDAV 351 P08686 ETTANTLSWAVVFLLHHPEIQQRLQEELDHELGPGASSSRVPYKDRARLPLLNAT 348 O15528 DTVSNTLSWALYELSRHPEVQTALHSEITAALSPG.SSAYPSATVLSQLPLLKAV 373 P08684 ETTSSVLSFIMYELATHPDVQQKLQEEIDAVLP...NKAPPTYDTVLQMEYLDMV 359 consensus ETTS.TLs.al..Ll.HPEVq.klQEEId.vlg...S...drs.mPyldAV P10632 VHEIQRYSDLVPTGVPHAVTTDTKFRNYLIPKGTTIMALLTSVLHDDKEFPNPNI 406 P11712 VHEVQRYIDLLPTSLPHAVTCDIKFRNYLIPKGTTILISLTSVLHDNKEFPNPEM 406 P08686 IAEVLRLRPVVPLALPHRTTRPSSISGYDIPEGTVIIPNLQGAHLDETVWERPHE 403 O15528 VKEVLRLYPVVP.GNSRVPDKDIHVGDYIIPKNTLVTLCHYATSRDPAQFPEPNS 427 P08684 VNETLRLFPIAM.RLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEK 413 consensus V.EvlRl.p.vP..lph..t.D...Y.IPKGT.i...l...D.k.fp.P.. P10632 FDPGHFLDKNGNFKKSDYFMPFSAGKRICAGEGLARMELFLFLTTILQNFNLKSV 461 P11712 FDPHHFLDEGGNFKKSKYFMPFSAGKRICVGEALAGMELFLFLTSILQNFNLKSL 461 P08686 FWPDRFLEPGKNSRA...LAFGCGARVCLGEPLARLELFVVLTRLLQAFTLLPS 454 O15528 FRPARWLGEGPTP.HPFASLPFGFGKRSCMGRRLAELELQMALAQILTHFEVQP. 480 P08684 FLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPC 468 consensus F.P.rFL..g.n...PFg.GkR.C.Ge.LA.mELfl.Lt.iLQnF.lkp. Procite P450 signature: [FW]-[SGNH]-x-[GD]-F-[RKHPT]-P-C-[LIVMFAP]-[GAD] Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 55/80

56 Why Multiple Sequence Alignments Conserved residues can only be identified in context of many sequences. Identification of functional residues, motifs and domains. Often final proof that sequences belong into one family. Analysis of evolutionary relationships Phylogenetic analyses Threading and homology modeling Functional mapping of mutations Many more utilities Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 56/80

57 Typical Sequence Analysis Routine Sequence candidate Sequence & Domain Database Searches Multiple Sequence Alignments Phylogenetic Tree Homology Modeling Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 57/80

58 Feng-Doolittle Progressive Multiple Alignment 1 Calculate all pairwise alignments. 2 Use scores to calculate pairwise distance matrix. 3 Construct guide tree based on distances using fast clustering algorithm. 4 Align sequences in order defined by tree moving from leaves to root. 5 Alignments are constructed by pairwise dynamic programming algorithm. 6 When sequences or alignments are aligned to existing alignments the highest scoring pairwise alignment defines the global alignment. Existing gaps will never be removed: once a gap always a gap. Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 58/80

59 Additional Examples CLUSTALW: Most widely used multiple alignment program developed by [Thompson et al 1994]. Multalin: uses iterative hierarchical clustering for guide tree formation [Corpet 1988]. T-Coffee: reevaluates sequence alignments in each iteration of progressive alignment approach [Notredame et al 2000]. Dialign: no gap penalty to identify sequence similarities with long gaps [Morgenstern et al 1998]. Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 59/80

60 Alignment Programs Pairwise Alignment Programs Smith-Waterman local alignment: WATER (EMBOSS) BLAST-like local alignments: BLAST2 Global alignments: NEEDLE (EMBOSS) Multiple Alignment Programs ClustalW multiple alignment: EMMA Hierarchical clustering: MultAlin For diverse sequences: MSA from NCBI For diverse sequences: T-Coffee For diverse sequences: MUSCLE For diverse sequences: HMMER (hmmalign) For local similarities (long gaps): DIALIGN DNA alignment guided by protein alignment: TRANALIGN Align cdnas to genome: EST2GENOME Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 60/80

61 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 61/80

62 Why Sequence Similarity Searching? Essential tool for retrieving related sequences from databases by providing protein or DNA sequences as queries. Applications Similarity-function principle to predict gene functions If the function of a query sequence is unknown, and a sequence similarity search retrieves highly similar sequences of known function, then it is likely that the query sequence has a similar function. Principle of relatedness for evolutionary analyses Sequence similarities can be used to reconstruct their phylogenetic relationships. For example: identification of sequences with a common ancestors, such as orthologs and paralogs Discovery of new genes or proteins. Exploring gene and protein structures.... Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 62/80

63 SSearch for Searching Sequence Databases list.html SSearch performs a rigorous Smith-Waterman sequence similarity search of a query sequence against a sequence database. It is the iterative version of Smith-Waterman algorithm for pairwise alignments (see previous lecture LA3) by performing the following computations: A. Align and score a query sequence against all members in database using Smith-Waterman algorithm. B. Rank search results by score. It is one of the most sensitive methods available for sequence similarity searching. It is much slower than the BLAST and FASTA search methods. Hardware solution: Smith-Waterman searches on FPGAs (field-programmable gate arrays), acceleration compared to CPUs. Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 63/80

64 Speed Acceleration by Heuristic Approaches Exhaustive/Rigorous Approaches Can guarantee optimum solution to a problem. Often impossible to compute because of extreme computation time. For example, impossibility of computing all possible pairwise alignments between two sequences, due to following relationship: Number of possible alignments 22n πn n = length of both sequences Heuristic Approaches Overcome computation time limitations by providing approximate rather than complete solutions. Optimum solution is not guaranteed. Approximate solutions are often of acceptable accuracy. Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 64/80

65 BLAST: Basic Local Alignment Search Tool Developed by [Altschul 1990] Most widely used similarity search tool Heuristic approach based on Smith-Waterman algorithm Finds best local alignments Provides statistical significance Online, command-line, and network clients Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 65/80

66 How BLAST Works? Step 1: Generate lookup hash table of query words Step 2: Scan database for hits Step 3: Ungapped extensions of hits Step 4: Gapped extensions of hits with traceback Step 5: Rank hits by scoring system Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 66/80

67 Step 1: Generate Lookup Table for Query Create lookup word table of query sequence by a moving window of size w. Example of lookup word table of query sequence with w = 3: Query: Query: Query: Query: Query: Query: Query: Query: Query: Query: GTQITVEDLFY GTQ GTQI GTQIT GTQITV GTQITVE GTQITVED GTQITVEDL GTQITVEDLF GTQITVEDLFY Word size w for proteins: 2 or 3 (3 is default). Word size w for BLASTN: min 7 (11 is default). Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 67/80

68 Step 2: Scan Database for Hits Match query lookup table against similar lookup table in database: Query:...LTV, ITV MTV, LSV, MSV, IAV, Database: LTV, ITV, MTV, LSV, MSV, IAV,... Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 68/80

69 Step 3: Ungapped Extensions of Hits Once a hit of a query word is found in the database, extend the hit in the sequences in either direction: Query:... GTQITVEDLFY... Database: WHKLCGTQITVEDLAQFY Query:... GTQITVEDLFY... Database: WHKLCGTQITVEDLAQFY Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 69/80

70 Step 4: Gapped Extensions of Hits Further extend alignment by gapped alignment with traceback method (dynamic programming): Query:... GTQITVEDL--FY... Database: WHKLCGTQITVEDLAQFY Keep track of the score by using substitution matrix (e.g. BLOSUM50). Stop when the score drops below some cutoff. Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 70/80

71 Step 5: Rank Hits by Scoring System BLAST search results in local alignments which are called HSPs (High Similarity Pairs). Their scores follow an extreme value distribution (EVD). E values are the most relevant scores of a BLAST search result. They are derived from the analysis of the distribution of alignment scores. Equation for calculating E values for an HSP: E = Kmne λs E = number of hits expected from search with scores greater than S K = scale for search space (constant) m = size of query sequence n = size of database S = score λ = scale for the specific scoring matrix Searches against larger databases give less significant E values than against smaller databases because of n dependency in above formula! Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 71/80

72 Meaning of E Value in BLAST Searches The E value (expectation value) expresses the number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Low E values are almost identical with P values: E value P value (1 e Evalue ) Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 72/80

73 Some More Details Nucleotide BLAST looks for exact matches CGTAGCTACGTAGCTACTACTACGTAC TACGTAGCTAC Protein BLAST requires two neighborhood matches GTQITVEDLFYNI QIT FYN SEQ and DUST programs are used to mask low complexity regions in query sequences Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 73/80

74 One Important Limitation of BLAST Alignments that BLAST can t find because of word size limitation of 7 nucleotides for DNA sequences: Query: Subject: ATCTACTACTACTTAGATCGAGCGTACGTGTTGACACACTATCTAC ATCTACCACTACTGAGATCGTGCGTACATGTTGAAACACTAGCTAC Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 74/80

75 Untranslated and Translated BLAST Tools Untranslated BLAST Tools BLASTN: query DNA vs DNA database BLASTP: query protein vs protein database Translated BLAST Tools BLASTX: translated query DNA vs protein database TBLASTN: query protein vs translated DNA database TBLASTX: translated query DNA vs translated DNA database Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 75/80

76 BLAST Flavors Different BLAST Programs: blastall: command-line collection of BLAST tools BLAST: BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX Psi-BLAST: Position-Specific Iterated BLAST RPS-BLAST: Reverse Position-Specific BLAST Phi-BLAST: Pattern Hit Initiated BLAST Mega-BLAST: 10 faster than BLASTN BLAST2: pairwise comparisons WU-BLAST: Washington University BLAST Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 76/80

77 Sequence Formats Format Definitions: help/help/sequence formats.htm FASTA format GenBank format Other formats: EMBL, SwissProt, GCG,... Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 77/80

78 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Online Exercise Slide 78/80

79 Online Exercise Continue on Exercise Page Link Day 1, Sequence Alignments and Searching Online Exercise Slide 79/80

80 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching References and Books Slide 80/80

81 References and Books Altschul, S F, Gish, W, Miller, W, Myers, E W, Lipman, D J (1990) Basic local alignment search tool. J Mol Biol, 215: URL Corpet F (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16, URL Dayhoff, MO, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure: Vol 5, Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162, URL Henikoff, S, Henikoff, JG (1992) Amino Acid Substitution Matrices from Protein Blocks. PNAS 89: URL Day 1, Sequence Alignments and Searching References and Books Slide 80/80

82 Morgenstern B, Frech K, Dress A, Werner T. (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, URL Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, URL Notredame, Higgins, Heringa (2000) T-Coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology 302, URL Pevsner, J (2009) Bioinformatics and Functional Genomics. Wiley-Blackwell, 2nd edition, pages. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147, URL Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, URL Day 1, Sequence Alignments and Searching References and Books Slide 80/80

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.