Bioinformatics Workshop - NM-AIST

Size: px
Start display at page:

Download "Bioinformatics Workshop - NM-AIST"

Transcription

1 Bioinformatics Workshop - NM-AIST Day 1 Sequence Alignments and Searching Thomas Girke July 23, 2012 Day 1, Sequence Alignments and Searching Slide 1/80

2 Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Outline Slide 2/80

3 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 3/80

4 What is Bioinformatics? Genomics The study of entire genomes with modern large-scale analytical techniques. Often the term is extended to other genome-wide omics study areas, like transcriptomics, proteomics, metabolomics, structural genomics, etc. Bioinformatics The analysis of biological information using computational and statistical techniques. This includes data sets from all genomics and other genome-wide study areas. Focus of this Workshop Provides a broad overview of important bioinformatics approaches. This includes genome sequencing, database techniques, structural, comparative and evolutionary genomics, microarray, next generation sequence and small molecule analysis. Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 4/80

5 Bioinformatics = Inferring Knowledge from Data Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 5/80

6 Organization and Structure of Genomes Genomes Chromosomes Genes mrna Proteins Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 6/80

7 From Genomes to Genes Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 7/80

8 Genome Sizes Organism Genome Size (bp) Genes Phage Phi-X 174 (virus) Escherichia coli (bacterium) ,000 Saccharomyces cerevisiae (yeast) ,000 Arabidopsis thaliana (plant) ,000 Fritillaria assyrica (plant) NA Caenorhabditis elegans (nematode) ,500 Homo sapiens ,000 Only approximate numbers are given in this table. The numbers usually vary with the quality of the genome annotations. The most accurate numbers can be found in genome databases (e.g. NCBI). Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 8/80

9 Chromatin Chromatin: complex of DNA and protein that makes up chromosomes. Histones are the major protein component. Functions: DNA packaging in mitosis and meiosis, and to serve as a mechanism to control expression and DNA replication. Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 9/80

10 Nucleosome Structure Nucleosome crystal structure Luger1997a Nucleosome core particle: 147 base pairs of DNA wrapped around a histone octamer consisting of two times of each histone: H2A, H2B, H3 and H4. Histone H1 links different nucleosomes together to compact chromatin structure. Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 10/80

11 From Genes to mrnas to Proteins to Compounds Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 11/80

12 Genetic Code and Amino Acids Name Abrev Codons Properties Alanine Ala A GCT GCC GCA GCG nonpolar Arginine Arg R CGT CGC CGA CGG AGA AGG pos charged Asparagine Asn N AAT AAC polar -CONH2 Aspartic acid Asp D GAT GAC neg charged Cysteine Cys C TGT TGC polar -SH Glutamine Gln Q CAA CAG polar -CONH2 Glutamic acid Glu E GAA GAG neg charged Glycine Gly G GGT GGC GGA GGG nonpolar Histidine His H CAT CAC pos charged Isoleucine Ile I ATT ATC ATA nonpolar Leucine Leu L TTA TTG CTT CTC CTA CTG nonpolar Lysine Lys K AAA AAG pos charged Methionine Met M ATG (START codon) nonpolar Phenylalanine Phe F TTT TTC nonpolar Proline Pro P CCT CCC CCA CCG nonpolar Serine Ser S TCT TCC TCA TCG AGT AGC polar -OH Threonine Thr T ACT ACC ACA ACG polar -OH Tryptophan Trp W TGG nonpolar Tyrosine Tyr Y TAT TAC polar -OH Valine Val V GTT GTC GTA GTG nonpolar STOP TAA TAG TGA Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 12/80

13 Terminology of Gene Elements and Important Processes Genome: hereditary information of organism encoded in DNA (RNA); contains gene and intergenic regions Gene: transcribed region; coding and non-coding genes (e.g. mirna) mrna: messenger RNA TU: transcriptional unit UTR: untranslated region ORF: open reading frame reaching from START to STOP codon CDS: coding sequence Promoter: regulatory region controlling gene expression Transcription: formation of mrna Splicing: removal of introns converts pre-mrna to mrna Translation: conversion of mrna to protein Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 13/80

14 Important Regulatory Processes Transcriptional control: promoter elements (DNA), regulatory RNAs, transcription factors (proteins) Post-transcriptional control: mrna turnover and availability. Translational control: translational control factors (proteins and RNAs) Post-translational control: protein activity modulations by modifications (e.g. phosphorylation by kinases) Control by small RNAs: mirnas and sirnas controlling gene expression on transcriptional and translational levels Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 14/80

15 What is Measured by Omics Technologies? Day 1, Sequence Alignments and Searching Introduction into Bioinformatics and Genome Biology Slide 15/80

16 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching DNA Sequencing Basics Slide 16/80

17 What are we sequencing? DNA sequencing is more efficient, because: DNA cloning and amplification is easy. Availability of efficient enzymatic sequencing reactions. Protein sequencing is much harder, because: No cloning or amplification techniques are available. Limited availability of enzymatic sequencing techniques. Chemical nature of proteins makes sequencing difficult. Day 1, Sequence Alignments and Searching DNA Sequencing Basics Slide 17/80

18 DNA Libraries A DNA library consists of cloned DNA fragments that can represent the entire genome of an organism (genomic DNA library) or its mrna sequences only (cdna library). Genomic library Contains often entire DNA content of an organism. Suitable for determining genomic DNA sequence. Requires chromosomal DNA isolation. cdna library Contains the mrnas that are expressed in a tissue sample. mrna is used as starting material mrna needs to be reverse transcribed into cdna Requires mrna isolation Challenges: cdna libraries tend to be incomplete with regard to: 5 sequences Representation of all genes in the genome Day 1, Sequence Alignments and Searching DNA Sequencing Basics Slide 18/80

19 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching DNA Sequencing Basics Traditional DNA Sequencing Slide 19/80

20 Chemical Sequencing by Maxam & Gilbert 1 Uses radioactive labeled DNA fragments of 500 bp. 2 Four separate chemical treatments generate DNA breaks at the positions: G, A+G, C, C+T. 3 The fragments are size-separated by gel electrophoresis in four separate lanes. 4 Visualization of the fragments by autoradiography on an X-ray film. Chemical DNA Degradation Gel Electrophoresis Day 1, Sequence Alignments and Searching DNA Sequencing Basics Traditional DNA Sequencing Slide 20/80

21 Sanger Dideoxy or Chain-Termination Sequencing Is based on an enzymatic DNA elongation reaction that is randomly terminated with dideoxynucleotides. Reaction mix: Single-stranded DNA template, DNA primer, DNA polymerase, radioactively or fluorescently labeled nucleotides Dideoxynucleotides to terminate the DNA strand elongation Day 1, Sequence Alignments and Searching DNA Sequencing Basics Traditional DNA Sequencing Slide 21/80

22 Required Sequencing Reactions for Human Genome 3, 000, 000, 000/800 = 3, 750, 000 dye-terminator sequencing reactions with a read length of 800 bp are necessary to obtain a sequence set with the same number of bases as the human genome (3,000,000,000 bp). The 3,750,000 sequences of 800 bp provide less than a one-time coverage of the genome. At least 10-times coverage is necessary to complete a genome sequence, which totals to 37,500,000 sequences of 800 bp. With a 384 capillary sequencer one can obtain the required sequences with 37, 500, 000/384 = 97, runs. Assuming one sequencing run per day, it will take with a single capillary sequencer 97, /365 = 268 years to finish the human genome or 268 sequencers to finish it in one year. The cost of a capillary sequencer is around $500,000. Including reagent and labor cost, the human genome sequence was a $1.5-3 billion investment. Day 1, Sequence Alignments and Searching DNA Sequencing Basics Traditional DNA Sequencing Slide 22/80

23 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 23/80

24 Example: Illumina/Solexa Technology Illumina Sequencer Flow Cell Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 24/80

25 Basic Steps of Illumina/Solexa Sequencing Technology Flow Cell Loading Compare with illustration on next three slides! 1 Generate DNA library (genomic- or cdna-based) with insert length of 200 bp. 2 Load library onto flow cell (nano device for liquid handling). 3 PCR-based bridge amplification of loaded fragments to obtain DNA clusters (serves signal amplification) Sequencing Cycles 4 Start reversible dye-terminator reaction containing primer and labeled dntps among other components. 5 Image scan to detect the identity of first base of each cluster via the characteristic fluorescence signal for each labeled nucleotide. 6 De-protection step removes the blocking group and fluorescence group of the incorporated nucleotide. 7 Repeat steps 4-6 about times. Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 25/80

26 Loading of Flow Cell Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 26/80

27 Sequencing Cycles Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 27/80

28 Processing of Sequencing Raw Data Assign quality score to image intensity values or peak areas The frequently used Phred scores provide log(10)-transformed error probability values: score = 20 corresponds to a 1% error rate score = 30 corresponds to a 0.1% error rate score = 40 corresponds to a 0.01% error rate The base calling (A, T, G or C) is performed based on Phred scores. Ambiguous positions with Phred scores 20 often assigned N. Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 28/80

29 Sequence Assembly Genomic Library Sequencing Shotgun Sequences Assembly by Alignment Contigs Scaffolds Chromosome NNN Contig Sequences Additional Data NNN Finishing Steps Day 1, Sequence Alignments and Searching DNA Sequencing Basics Next Generation Sequencing Slide 29/80

30 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 30/80

31 Illustration of Sequence Alignment Process Goal: maximize number of identical and similar residues in columns of alignment. Unaligned Sequences P10632 P08686 Aligned Sequences LKNLNTTAVFMPFSAGKRICAGEGLARMELFGGLFLTTILQNFNLKSVDD LAFGCGARVCLGEPLARLELFVVLTRLLQAFTLLPSGD Identify alignment start Insert gaps if necessary P10632 LKNLNTTAVFMPFSAGKRICAGEGLARMELFGGLFLTTILQNFNLKSVDD P LAFGCGARVCLGEPLARLELF..VVLTRLLQAFTLLPSGD consensus...f..g.r.c.ge.lar.elf...lt..lq.f.l...d 2 logo 1 LKNLNTTAVFL M A P FG S A C GA K RI V CA L GEG P LARL M ELFGGL VVLTR F TL I LQ GD 2 NFN T A LK L P SV 1 SD Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 31/80

32 Why Comparing Sequences? Background The evolution of biological sequences is mainly driven by gene duplications, point mutations, insertions and deletions. Alignment algorithms are the central tool to detect these events and to perform sequence similarity analyses in general. Utilities Functional analyses: Conserved sequence regions are functionally important. Evolutionary analyses: Sequences divergence patters can be used to reconstruct their phylogenetic relationships. Mutation and SNP analyses Comparative genomics Sequence similarity searching is based on alignment methods. Many other utilities Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 32/80

33 Why Gapped Alignments? Sequences evolve by complex mutation processes Gene duplications* Gene deletions* Point mutations Substitutions Insertions* Deletions* *require gaps Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 33/80

34 Sequence Alignment Concepts String Matching (lacks gaps of alignment approach) Global Pairwise Alignment Local Pairwise Alignment Multiple Alignment H E A G A W G H E E A W G H E H E A G A W G H E - E - - P - A W - H E A E A W G H E A W - H E H E A G A W G H E - E H D A C A W G H E - E H D A C - W G H E - E H D - C S T G H E - E - - P - A W - H E A E Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 34/80

35 Important Steps in Pairwise Alignment Process 1 Type of alignment 2 Scoring system 3 Algorithm to find best scoring alignment 4 Statistics to evaluate significance of alignment score Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 35/80

36 Best Sequence Type for Alignments Divergent Sequences If available, use protein sequences, because of higher information content, better scoring system, reliability of alignment, functional constraints, etc. Similar Sequences Protein or DNA sequence depending on analysis needs. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 36/80

37 Scoring Parameters for Alignments Substitution matrix Empirically determined rates at which one residue in a sequence changes to another residue over time. These substitution rates are typically expressed as: Log-Odds Scores s i,j = log p i M i,j p i p j observed frequency = log expected frequency M i,j = probability of AA i transforming into AA j ; p i = frequency of AA i. Gap Opening Penalty Penalty score for gap insertion. Often severe value to minimize the number of gaps. Gap Extension Penalty Penalty score for gap extension. Often severe value to minimize the length of gaps. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 37/80

38 Scoring or Substitution Matrices BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) Based on functional model for analyzing divergent protein sequences [Henikoff & Henikoff 1992]. The log-odds scores were obtained from the substitution probabilities in conserved and gap-less regions of protein families in the BLOCKS database - one for each of the possible substitutions of the 20 standard amino acids. Matrices with low values (e.g. BLOSUM50) are for divergent sequences, and matrices with high values (e.g. BLOSUM80) for more related sequences. PAM (Point Accepted Mutation Matrix) Based on evolutionary model for analyzing protein sequences [Dayhoff et al 1978]. The mutations are considered throughout the global alignment in conserved and unconserved regions of many well studied protein families. PAM matrices with higher numbers are for studying evolutionary distant sequences, while PAMs with smaller numbers are for more related sequences (opposite in BLOSUM matrices). Matrices for DNA and RNA alignments Often simple scoring matrices are used where matches have a positive match score, mismatches a negative mismatch score, and gaps a negative gap penalty. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 38/80

39 Substitution Matrix BLOSUM50 (NCBI Matrix Download) A R N D C Q E G H I L K M F P S T W Y V B Z X A R N D C Q E G H I L K M F P S T W Y V B Z X B (Asx): aspartic acid, asparagine; Z (Glx): glutamic acid, glutamine; X (Xaa): other amino acid Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 39/80

40 Gap Penalties To build high-quality alignments, it is important to control the number and the length of the gaps that are introduced during the alignment building process. This is achieved by selecting gap penalties for the alignment scoring process which are usually negative values. Constant gap penalty Every gap receives the same penalty independent of its size. Linear gap penalty Linear gap penalties have only parameter (d) which is linear to the length of the gap. Disadvantage: the overall penalty for one large gap is the same as for many small gaps that add up to the same length. Affine gap open and extension penalties [most commonly used!] Attempts of overcome the problem of the linear gap penalty by using a gap opening penalty (o) and a gap extension penalty (e). Their values are often set so that gap insertions are discouraged and longer gaps are favored over many short gaps. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Slide 40/80

41 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 41/80

42 Global Alignment: Needleman-Wunsch Algorithm Initial algorithm [Needleman and Wunsch 1970] Improved version [Gotoh 1982] Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 42/80

43 Dynamic Programming Alignment Algorithm Impossible to calculate all possible alignments The number of possible alignments: (2n)! (n!) 2 22n πn n = length of both sequences Solution: dynamic programming algorithm Algorithm for finding an optimal alignment between two sequences with an additive scoring system. Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 43/80

44 Main Steps in Dynamic Programming Alignment Algorithms Recurrence rules: dynamic programming matrix Boundary conditions: gaps, termination and extensions Traceback step: optimal alignment Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 44/80

45 Global Sequence Alignment Algorithm The dynamic programming approach builds an optimal alignment stepwise by adding solutions of optimal sub-alignments. This is achieved with the following steps: 1.1 Construct dynamic programming matrix F (i, j) where the rows i and columns j represent the residues of the two sequences. 2.1 Fill the matrix from the top left to bottom right with the largest score of three possible substitution solutions. First cell is initialized with F (0, 0) = 0. F (i 1, j 1) + s(x i, y i ), F (i, j) = max F (i 1, j) d, F (i, j 1) d. F (i, j) = additive substitution score of each sub-solution d = gap score (gap penalty) 2.2 Boundary rows and columns are filled with: F (i, 0) = id and F (0, j) = jd. 2.3 Apply operation repeatedly to bottom right corner of each square of four cells (see next slide). 2.4 In each step store a pointer for each cell back to the cell from which F (i, j) was derived. 2.5 Value in final bottom right cell is the final score of the alignment 3.1 Generate alignment by traceback method: starting from final cell move back along the stored pointers and align residues according to movement directions. - Diagonal movement: align corresponding residues - Up or left movement: insert gaps accordingly Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 45/80

46 Three Possibilities : Align, Insert or Delete F (i 1, j 1) F (i, j 1)...s(i,j) -d F (i 1, j)... -d F(i,j) F (i, j) = largest of 3 possible solutions Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 46/80

47 Dynamic Programming Matrix: Global Alignment Substitution matrix: BLOSUM50 Gap opening and extension penalties: 8 H E A G A W G H E E P A W H E A E H E A G A W G H E - E - - P - A W - H E A E Final Score = 1 Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 47/80

48 Complexity of Algorithm Time and memory cost: O(nm) nm: product of length of two sequences O: of order nm Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Global Alignment Slide 48/80

49 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 49/80

50 Local Alignment: Smith-Waterman Algorithm Often more important than global alignment, because related sequences show frequently only local similarities. Initial algorithm [Smith and Waterman 1981] Improved version [Gotoh 1982] Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 50/80

51 Local Sequence Alignment Algorithm Algorithm closely related to global alignment approach, with the following modifications: 1.1 Same as global alignment. 2.1 One more possible solution is added to F (i, j). If all other solutions are less than zero then F (i, j) will be set to 0: 0, F (i 1, j 1) + s(x i, y i ), F (i, j) = max F (i 1, j) d, F (i, j 1) d. F (i, j) = additive substitution score of each sub-solution d = gap score (gap penalty) 2.2 Boundary rows and columns are filled with zeros. 2.3 Taking the value zero corresponds to starting a new alignment.* 2.4 Alignments can start and end anywhere in the matrix. 3.1 Best local alignment: traceback from highest score in matrix until first cell with zero is reached. Value in initial traceback cell is alignment score. Important requirement: random matches must receive negative values by scoring system, otherwise long unrelated matches would mask significant local matches! Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 51/80

52 Dynamic Programming Matrix: Local Alignment Substitution matrix: BLOSUM50 Gap opening and extension penalties: 8 H E A G A W G H E E P A W H E A E A W G H E A W - H E Score of highest ranking alignment = 28 Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 52/80

53 Additional Algorithms for Pairwise Alignments Repeat alignment algorithm Algorithm for obtaining all non-overlapping local alignments with significant scores. Maximum overlap match algorithm Global alignment algorithm without penalizing overhanging ends like in sequence assembly problem. Algorithms for allowing long gaps Aligning cdna (no introns) to genomic DNA (introns). Many more algorithms for specific alignment problems Day 1, Sequence Alignments and Searching Pairwise Sequence Alignments Local Alignment Slide 53/80

54 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 54/80

55 Multiple Sequence Alignments Example: Five Human P450 Hydroxylases P10632 ETTSTTLRYGLLLLLKHPEVTAKVQEEIDHVIGRHRSPCM...QDRSHMPYTDAV 351 P11712 ETTSTTLRYALLLLLKHPEVTAKVQEEIERVIGRNRSPCM...QDRSHMPYTDAV 351 P08686 ETTANTLSWAVVFLLHHPEIQQRLQEELDHELGPGASSSRVPYKDRARLPLLNAT 348 O15528 DTVSNTLSWALYELSRHPEVQTALHSEITAALSPG.SSAYPSATVLSQLPLLKAV 373 P08684 ETTSSVLSFIMYELATHPDVQQKLQEEIDAVLP...NKAPPTYDTVLQMEYLDMV 359 consensus ETTS.TLs.al..Ll.HPEVq.klQEEId.vlg...S...drs.mPyldAV P10632 VHEIQRYSDLVPTGVPHAVTTDTKFRNYLIPKGTTIMALLTSVLHDDKEFPNPNI 406 P11712 VHEVQRYIDLLPTSLPHAVTCDIKFRNYLIPKGTTILISLTSVLHDNKEFPNPEM 406 P08686 IAEVLRLRPVVPLALPHRTTRPSSISGYDIPEGTVIIPNLQGAHLDETVWERPHE 403 O15528 VKEVLRLYPVVP.GNSRVPDKDIHVGDYIIPKNTLVTLCHYATSRDPAQFPEPNS 427 P08684 VNETLRLFPIAM.RLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEK 413 consensus V.EvlRl.p.vP..lph..t.D...Y.IPKGT.i...l...D.k.fp.P.. P10632 FDPGHFLDKNGNFKKSDYFMPFSAGKRICAGEGLARMELFLFLTTILQNFNLKSV 461 P11712 FDPHHFLDEGGNFKKSKYFMPFSAGKRICVGEALAGMELFLFLTSILQNFNLKSL 461 P08686 FWPDRFLEPGKNSRA...LAFGCGARVCLGEPLARLELFVVLTRLLQAFTLLPS 454 O15528 FRPARWLGEGPTP.HPFASLPFGFGKRSCMGRRLAELELQMALAQILTHFEVQP. 480 P08684 FLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPC 468 consensus F.P.rFL..g.n...PFg.GkR.C.Ge.LA.mELfl.Lt.iLQnF.lkp. Procite P450 signature: [FW]-[SGNH]-x-[GD]-F-[RKHPT]-P-C-[LIVMFAP]-[GAD] Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 55/80

56 Why Multiple Sequence Alignments Conserved residues can only be identified in context of many sequences. Identification of functional residues, motifs and domains. Often final proof that sequences belong into one family. Analysis of evolutionary relationships Phylogenetic analyses Threading and homology modeling Functional mapping of mutations Many more utilities Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 56/80

57 Typical Sequence Analysis Routine Sequence candidate Sequence & Domain Database Searches Multiple Sequence Alignments Phylogenetic Tree Homology Modeling Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 57/80

58 Feng-Doolittle Progressive Multiple Alignment 1 Calculate all pairwise alignments. 2 Use scores to calculate pairwise distance matrix. 3 Construct guide tree based on distances using fast clustering algorithm. 4 Align sequences in order defined by tree moving from leaves to root. 5 Alignments are constructed by pairwise dynamic programming algorithm. 6 When sequences or alignments are aligned to existing alignments the highest scoring pairwise alignment defines the global alignment. Existing gaps will never be removed: once a gap always a gap. Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 58/80

59 Additional Examples CLUSTALW: Most widely used multiple alignment program developed by [Thompson et al 1994]. Multalin: uses iterative hierarchical clustering for guide tree formation [Corpet 1988]. T-Coffee: reevaluates sequence alignments in each iteration of progressive alignment approach [Notredame et al 2000]. Dialign: no gap penalty to identify sequence similarities with long gaps [Morgenstern et al 1998]. Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 59/80

60 Alignment Programs Pairwise Alignment Programs Smith-Waterman local alignment: WATER (EMBOSS) BLAST-like local alignments: BLAST2 Global alignments: NEEDLE (EMBOSS) Multiple Alignment Programs ClustalW multiple alignment: EMMA Hierarchical clustering: MultAlin For diverse sequences: MSA from NCBI For diverse sequences: T-Coffee For diverse sequences: MUSCLE For diverse sequences: HMMER (hmmalign) For local similarities (long gaps): DIALIGN DNA alignment guided by protein alignment: TRANALIGN Align cdnas to genome: EST2GENOME Day 1, Sequence Alignments and Searching Multiple Sequence Alignments Slide 60/80

61 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 61/80

62 Why Sequence Similarity Searching? Essential tool for retrieving related sequences from databases by providing protein or DNA sequences as queries. Applications Similarity-function principle to predict gene functions If the function of a query sequence is unknown, and a sequence similarity search retrieves highly similar sequences of known function, then it is likely that the query sequence has a similar function. Principle of relatedness for evolutionary analyses Sequence similarities can be used to reconstruct their phylogenetic relationships. For example: identification of sequences with a common ancestors, such as orthologs and paralogs Discovery of new genes or proteins. Exploring gene and protein structures.... Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 62/80

63 SSearch for Searching Sequence Databases list.html SSearch performs a rigorous Smith-Waterman sequence similarity search of a query sequence against a sequence database. It is the iterative version of Smith-Waterman algorithm for pairwise alignments (see previous lecture LA3) by performing the following computations: A. Align and score a query sequence against all members in database using Smith-Waterman algorithm. B. Rank search results by score. It is one of the most sensitive methods available for sequence similarity searching. It is much slower than the BLAST and FASTA search methods. Hardware solution: Smith-Waterman searches on FPGAs (field-programmable gate arrays), acceleration compared to CPUs. Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 63/80

64 Speed Acceleration by Heuristic Approaches Exhaustive/Rigorous Approaches Can guarantee optimum solution to a problem. Often impossible to compute because of extreme computation time. For example, impossibility of computing all possible pairwise alignments between two sequences, due to following relationship: Number of possible alignments 22n πn n = length of both sequences Heuristic Approaches Overcome computation time limitations by providing approximate rather than complete solutions. Optimum solution is not guaranteed. Approximate solutions are often of acceptable accuracy. Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 64/80

65 BLAST: Basic Local Alignment Search Tool Developed by [Altschul 1990] Most widely used similarity search tool Heuristic approach based on Smith-Waterman algorithm Finds best local alignments Provides statistical significance Online, command-line, and network clients Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 65/80

66 How BLAST Works? Step 1: Generate lookup hash table of query words Step 2: Scan database for hits Step 3: Ungapped extensions of hits Step 4: Gapped extensions of hits with traceback Step 5: Rank hits by scoring system Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 66/80

67 Step 1: Generate Lookup Table for Query Create lookup word table of query sequence by a moving window of size w. Example of lookup word table of query sequence with w = 3: Query: Query: Query: Query: Query: Query: Query: Query: Query: Query: GTQITVEDLFY GTQ GTQI GTQIT GTQITV GTQITVE GTQITVED GTQITVEDL GTQITVEDLF GTQITVEDLFY Word size w for proteins: 2 or 3 (3 is default). Word size w for BLASTN: min 7 (11 is default). Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 67/80

68 Step 2: Scan Database for Hits Match query lookup table against similar lookup table in database: Query:...LTV, ITV MTV, LSV, MSV, IAV, Database: LTV, ITV, MTV, LSV, MSV, IAV,... Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 68/80

69 Step 3: Ungapped Extensions of Hits Once a hit of a query word is found in the database, extend the hit in the sequences in either direction: Query:... GTQITVEDLFY... Database: WHKLCGTQITVEDLAQFY Query:... GTQITVEDLFY... Database: WHKLCGTQITVEDLAQFY Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 69/80

70 Step 4: Gapped Extensions of Hits Further extend alignment by gapped alignment with traceback method (dynamic programming): Query:... GTQITVEDL--FY... Database: WHKLCGTQITVEDLAQFY Keep track of the score by using substitution matrix (e.g. BLOSUM50). Stop when the score drops below some cutoff. Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 70/80

71 Step 5: Rank Hits by Scoring System BLAST search results in local alignments which are called HSPs (High Similarity Pairs). Their scores follow an extreme value distribution (EVD). E values are the most relevant scores of a BLAST search result. They are derived from the analysis of the distribution of alignment scores. Equation for calculating E values for an HSP: E = Kmne λs E = number of hits expected from search with scores greater than S K = scale for search space (constant) m = size of query sequence n = size of database S = score λ = scale for the specific scoring matrix Searches against larger databases give less significant E values than against smaller databases because of n dependency in above formula! Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 71/80

72 Meaning of E Value in BLAST Searches The E value (expectation value) expresses the number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Low E values are almost identical with P values: E value P value (1 e Evalue ) Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 72/80

73 Some More Details Nucleotide BLAST looks for exact matches CGTAGCTACGTAGCTACTACTACGTAC TACGTAGCTAC Protein BLAST requires two neighborhood matches GTQITVEDLFYNI QIT FYN SEQ and DUST programs are used to mask low complexity regions in query sequences Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 73/80

74 One Important Limitation of BLAST Alignments that BLAST can t find because of word size limitation of 7 nucleotides for DNA sequences: Query: Subject: ATCTACTACTACTTAGATCGAGCGTACGTGTTGACACACTATCTAC ATCTACCACTACTGAGATCGTGCGTACATGTTGAAACACTAGCTAC Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 74/80

75 Untranslated and Translated BLAST Tools Untranslated BLAST Tools BLASTN: query DNA vs DNA database BLASTP: query protein vs protein database Translated BLAST Tools BLASTX: translated query DNA vs protein database TBLASTN: query protein vs translated DNA database TBLASTX: translated query DNA vs translated DNA database Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 75/80

76 BLAST Flavors Different BLAST Programs: blastall: command-line collection of BLAST tools BLAST: BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX Psi-BLAST: Position-Specific Iterated BLAST RPS-BLAST: Reverse Position-Specific BLAST Phi-BLAST: Pattern Hit Initiated BLAST Mega-BLAST: 10 faster than BLASTN BLAST2: pairwise comparisons WU-BLAST: Washington University BLAST Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 76/80

77 Sequence Formats Format Definitions: help/help/sequence formats.htm FASTA format GenBank format Other formats: EMBL, SwissProt, GCG,... Day 1, Sequence Alignments and Searching Sequence Similarity Searching Slide 77/80

78 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching Online Exercise Slide 78/80

79 Online Exercise Continue on Exercise Page Link Day 1, Sequence Alignments and Searching Online Exercise Slide 79/80

80 Outline Outline Introduction into Bioinformatics and Genome Biology DNA Sequencing Basics Traditional DNA Sequencing Next Generation Sequencing Pairwise Sequence Alignments Global Alignment Local Alignment Multiple Sequence Alignments Sequence Similarity Searching Online Exercise References and Books Day 1, Sequence Alignments and Searching References and Books Slide 80/80

81 References and Books Altschul, S F, Gish, W, Miller, W, Myers, E W, Lipman, D J (1990) Basic local alignment search tool. J Mol Biol, 215: URL Corpet F (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16, URL Dayhoff, MO, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure: Vol 5, Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162, URL Henikoff, S, Henikoff, JG (1992) Amino Acid Substitution Matrices from Protein Blocks. PNAS 89: URL Day 1, Sequence Alignments and Searching References and Books Slide 80/80

82 Morgenstern B, Frech K, Dress A, Werner T. (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, URL Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, URL Notredame, Higgins, Heringa (2000) T-Coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology 302, URL Pevsner, J (2009) Bioinformatics and Functional Genomics. Wiley-Blackwell, 2nd edition, pages. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147, URL Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, URL Day 1, Sequence Alignments and Searching References and Books Slide 80/80

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

Practical Bioinformatics

Practical Bioinformatics 5/2/2017 Dictionaries d i c t i o n a r y = { A : T, T : A, G : C, C : G } d i c t i o n a r y [ G ] d i c t i o n a r y [ N ] = N d i c t i o n a r y. h a s k e y ( C ) Dictionaries g e n e t i c C o

More information

Advanced topics in bioinformatics

Advanced topics in bioinformatics Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

More information

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

NSCI Basic Properties of Life and The Biochemistry of Life on Earth NSCI 314 LIFE IN THE COSMOS 4 Basic Properties of Life and The Biochemistry of Life on Earth Dr. Karen Kolehmainen Department of Physics CSUSB http://physics.csusb.edu/~karen/ WHAT IS LIFE? HARD TO DEFINE,

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA SUPPORTING INFORMATION FOR SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA Aik T. Ooi, Cliff I. Stains, Indraneel Ghosh *, David J. Segal

More information

Supplementary Information for

Supplementary Information for Supplementary Information for Evolutionary conservation of codon optimality reveals hidden signatures of co-translational folding Sebastian Pechmann & Judith Frydman Department of Biology and BioX, Stanford

More information

SUPPLEMENTARY DATA - 1 -

SUPPLEMENTARY DATA - 1 - - 1 - SUPPLEMENTARY DATA Construction of B. subtilis rnpb complementation plasmids For complementation, the B. subtilis rnpb wild-type gene (rnpbwt) under control of its native rnpb promoter and terminator

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc Supplemental Figure 1. Prediction of phloem-specific MTK1 expression in Arabidopsis shoots and roots. The images and the corresponding numbers showing absolute (A) or relative expression levels (B) of

More information

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin International Journal of Genetic Engineering and Biotechnology. ISSN 0974-3073 Volume 2, Number 1 (2011), pp. 109-114 International Research Publication House http://www.irphouse.com Characterization of

More information

Lecture 15: Realities of Genome Assembly Protein Sequencing

Lecture 15: Realities of Genome Assembly Protein Sequencing Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Crick s early Hypothesis Revisited

Crick s early Hypothesis Revisited Crick s early Hypothesis Revisited Or The Existence of a Universal Coding Frame Ryan Rossi, Jean-Louis Lassez and Axel Bernal UPenn Center for Bioinformatics BIOINFORMATICS The application of computer

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr), 48 3 () Vol. 48 No. 3 2009 5 Journal of Xiamen University (Nat ural Science) May 2009 SSR,,,, 3 (, 361005) : SSR. 21 516,410. 60 %96. 7 %. (),(Between2groups linkage method),.,, 11 (),. 12,. (, ), : 0.

More information

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm Electronic Supplementary Material (ESI) for Nanoscale. This journal is The Royal Society of Chemistry 2018 High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence

More information

Sequence comparison: Score matrices

Sequence comparison: Score matrices Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

More information

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Protein Threading. Combinatorial optimization approach. Stefan Balev. Protein Threading Combinatorial optimization approach Stefan Balev Stefan.Balev@univ-lehavre.fr Laboratoire d informatique du Havre Université du Havre Stefan Balev Cours DEA 30/01/2004 p.1/42 Outline

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics 582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016 Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k Outline

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Proteins: Characteristics and Properties of Amino Acids

Proteins: Characteristics and Properties of Amino Acids SBI4U:Biochemistry Macromolecules Eachaminoacidhasatleastoneamineandoneacidfunctionalgroupasthe nameimplies.thedifferentpropertiesresultfromvariationsinthestructuresof differentrgroups.thergroupisoftenreferredtoastheaminoacidsidechain.

More information

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models) Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model

More information

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Supplemental Table 1. Primers used for cloning and PCR amplification in this study Supplemental Table 1. Primers used for cloning and PCR amplification in this study Target Gene Primer sequence NATA1 (At2g393) forward GGG GAC AAG TTT GTA CAA AAA AGC AGG CTT CAT GGC GCC TCC AAC CGC AGC

More information

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Clay Carter Department of Biology QuickTime and a TIFF (LZW) decompressor are needed to see this picture. Ornamental tobacco

More information

Electronic supplementary material

Electronic supplementary material Applied Microbiology and Biotechnology Electronic supplementary material A family of AA9 lytic polysaccharide monooxygenases in Aspergillus nidulans is differentially regulated by multiple substrates and

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies Richard Owen (1848) introduced the term Homology to refer to structural similarities among organisms. To Owen, these similarities indicated that organisms were created following a common plan or archetype.

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Number-controlled spatial arrangement of gold nanoparticles with

Number-controlled spatial arrangement of gold nanoparticles with Electronic Supplementary Material (ESI) for RSC Advances. This journal is The Royal Society of Chemistry 2016 Number-controlled spatial arrangement of gold nanoparticles with DNA dendrimers Ping Chen,*

More information

Genome Sequencing & DNA Sequence Analysis

Genome Sequencing & DNA Sequence Analysis 7.91 / 7.36 / BE.490 Lecture #1 Feb. 24, 2004 Genome Sequencing & DNA Sequence Analysis Chris Burge What is a Genome? A genome is NOT a bag of proteins What s in the Human Genome? Outline of Unit II: DNA/RNA

More information

Codon Distribution in Error-Detecting Circular Codes

Codon Distribution in Error-Detecting Circular Codes life Article Codon Distribution in Error-Detecting Circular Codes Elena Fimmel, * and Lutz Strüngmann Institute for Mathematical Biology, Faculty of Computer Science, Mannheim University of Applied Sciences,

More information

Using an Artificial Regulatory Network to Investigate Neural Computation

Using an Artificial Regulatory Network to Investigate Neural Computation Using an Artificial Regulatory Network to Investigate Neural Computation W. Garrett Mitchener College of Charleston January 6, 25 W. Garrett Mitchener (C of C) UM January 6, 25 / 4 Evolution and Computing

More information

Introduction to Molecular Phylogeny

Introduction to Molecular Phylogeny Introduction to Molecular Phylogeny Starting point: a set of homologous, aligned DNA or protein sequences Result of the process: a tree describing evolutionary relationships between studied sequences =

More information

Translation. A ribosome, mrna, and trna.

Translation. A ribosome, mrna, and trna. Translation The basic processes of translation are conserved among prokaryotes and eukaryotes. Prokaryotic Translation A ribosome, mrna, and trna. In the initiation of translation in prokaryotes, the Shine-Dalgarno

More information

Supplementary Information

Supplementary Information Electronic Supplementary Material (ESI) for RSC Advances. This journal is The Royal Society of Chemistry 2014 Directed self-assembly of genomic sequences into monomeric and polymeric branched DNA structures

More information

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1 Supplementary Figure 1 Zn 2+ -binding sites in USP18. (a) The two molecules of USP18 present in the asymmetric unit are shown. Chain A is shown in blue, chain B in green. Bound Zn 2+ ions are shown as

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)- Supporting Information for Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)- Dependence and Its Ability to Chelate Multiple Nutrient Transition Metal Ions Rose C. Hadley,

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Supporting Information

Supporting Information Supporting Information T. Pellegrino 1,2,3,#, R. A. Sperling 1,#, A. P. Alivisatos 2, W. J. Parak 1,2,* 1 Center for Nanoscience, Ludwig Maximilians Universität München, München, Germany 2 Department of

More information

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy Supporting Information Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy Cuichen Wu,, Da Han,, Tao Chen,, Lu Peng, Guizhi Zhu,, Mingxu You,, Liping Qiu,, Kwame Sefah,

More information

The Trigram and other Fundamental Philosophies

The Trigram and other Fundamental Philosophies The Trigram and other Fundamental Philosophies by Weimin Kwauk July 2012 The following offers a minimal introduction to the trigram and other Chinese fundamental philosophies. A trigram consists of three

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Introduction to protein alignments

Introduction to protein alignments Introduction to protein alignments Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

Evolutionary Analysis of Viral Genomes

Evolutionary Analysis of Viral Genomes University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral

More information

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Aoife McLysaght Dept. of Genetics Trinity College Dublin Aoife McLysaght Dept. of Genetics Trinity College Dublin Evolution of genome arrangement Evolution of genome content. Evolution of genome arrangement Gene order changes Inversions, translocations Evolution

More information

Sequences, Structures, and Gene Regulatory Networks

Sequences, Structures, and Gene Regulatory Networks Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Objective: You will be able to justify the claim that organisms share many conserved core processes and features. Objective: You will be able to justify the claim that organisms share many conserved core processes and features. Do Now: Read Enduring Understanding B Essential knowledge: Organisms share many conserved

More information

Properties of amino acids in proteins

Properties of amino acids in proteins Properties of amino acids in proteins one of the primary roles of DNA (but not the only one!) is to code for proteins A typical bacterium builds thousands types of proteins, all from ~20 amino acids repeated

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Heuristic Alignment and Searching

Heuristic Alignment and Searching 3/28/2012 Types of alignments Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the JOURNAL OF BACTERIOLOGY, Sept. 1987, p. 4355-4360 0021-9193/87/094355-06$02.00/0 Copyright X) 1987, American Society for Microbiology Vol. 169, No. 9 Biosynthesis of Bacterial Glycogen: Primary Structure

More information

Supplemental Figure 1.

Supplemental Figure 1. A wt spoiiiaδ spoiiiahδ bofaδ B C D E spoiiiaδ, bofaδ Supplemental Figure 1. GFP-SpoIVFA is more mislocalized in the absence of both BofA and SpoIIIAH. Sporulation was induced by resuspension in wild-type

More information

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R AAC MGG ATT AGA TAC CCK G GGY TAC CTT GTT ACG ACT T Detection of Candidatus

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods Cell communication channel Bioinformatics Methods Iosif Vaisman Email: ivaisman@gmu.edu SEQUENCE STRUCTURE DNA Sequence Protein Sequence Protein Structure Protein structure ATGAAATTTGGAAACTTCCTTCTCACTTATCAGCCACCT...

More information

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Viewing and Analyzing Proteins, Ligands and their Complexes 2 2 Viewing and Analyzing Proteins, Ligands and their Complexes 2 Overview Viewing the accessible surface Analyzing the properties of proteins containing thousands of atoms is best accomplished by representing

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

TM1 TM2 TM3 TM4 TM5 TM6 TM bp a 467 bp 1 482 2 93 3 321 4 7 281 6 21 7 66 8 176 19 12 13 212 113 16 8 b ATG TCA GGA CAT GTA ATG GAG GAA TGT GTA GTT CAC GGT ACG TTA GCG GCA GTA TTG CGT TTA ATG GGC GTA GTG M S G H V M E E C V V H G T

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION DOI:.8/NCHEM. Conditionally Fluorescent Molecular Probes for Detecting Single Base Changes in Double-stranded DNA Sherry Xi Chen, David Yu Zhang, Georg Seelig. Analytic framework and probe design.. Design

More information

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval Evolvable Neural Networs for Time Series Prediction with Adaptive Learning Interval Dong-Woo Lee *, Seong G. Kong *, and Kwee-Bo Sim ** *Department of Electrical and Computer Engineering, The University

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION DOI:.38/NCHEM.246 Optimizing the specificity of nucleic acid hyridization David Yu Zhang, Sherry Xi Chen, and Peng Yin. Analytic framework and proe design 3.. Concentration-adjusted

More information

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel ) Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

More information

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective Jacobs University Bremen Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective Semester Project II By: Dawit Nigatu Supervisor: Prof. Dr. Werner Henkel Transmission

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi Bioinformatics Sequence Analysis An introduction Part 8 Mahdi Vasighi Sequence analysis Some of the earliest problems in genomics concerned how to measure similarity of DNA and protein sequences, either

More information

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Why do more divergent sequences produce smaller nonsynonymous/synonymous Genetics: Early Online, published on June 21, 2013 as 10.1534/genetics.113.152025 Why do more divergent sequences produce smaller nonsynonymous/synonymous rate ratios in pairwise sequence comparisons?

More information