G4120: Introduction to Computational Biology

Similar documents
G4120: Introduction to Computational Biology

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Lecture 5: Sequence Analysis March 6, 2017

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Tools and Algorithms in Bioinformatics

Large-Scale Genomic Surveys

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Week 10: Homology Modelling (II) - HHpred

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Single alignment: Substitution Matrix. 16 march 2017

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Database Search Techniques I: Blast and PatternHunter tools

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Sequence Alignment Techniques and Their Uses

Algorithms in Bioinformatics

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Basic Local Alignment Search Tool

Pairwise & Multiple sequence alignments

Using Bioinformatics to Study Evolutionary Relationships Instructions

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Tools and Algorithms in Bioinformatics

Bioinformatics Exercises

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Multiple sequence alignment

Practical considerations of working with sequencing data

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence Alignment (chapter 6)

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Homology and Information Gathering and Domain Annotation for Proteins

Alignment & BLAST. By: Hadi Mozafari KUMS

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Bioinformatics and BLAST

Sequence analysis and comparison

Bioinformatics for Biologists

Introduction to Bioinformatics

Introduction to protein alignments

MegAlign Pro Pairwise Alignment Tutorials

Similarity or Identity? When are molecules similar?

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Computational Biology

Homology. and. Information Gathering and Domain Annotation for Proteins

Introduction to Bioinformatics Online Course: IBT

Comparing whole genomes

Sequence analysis and Genomics

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

An Introduction to Sequence Similarity ( Homology ) Searching

In-Depth Assessment of Local Sequence Alignment

Protein function prediction based on sequence analysis

Ch. 9 Multiple Sequence Alignment (MSA)

Copyright 2000 N. AYDIN. All rights reserved. 1

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Collected Works of Charles Dickens

BLAST. Varieties of BLAST

Effects of Gap Open and Gap Extension Penalties

Heuristic Alignment and Searching

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

CS612 - Algorithms in Bioinformatics

Quantifying sequence similarity

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Journal of Proteomics & Bioinformatics - Open Access

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Sequences, Structures, and Gene Regulatory Networks

Phylogenetic analyses. Kirsi Kostamo

Motivating the need for optimal sequence alignments...

Pairwise sequence alignment

Fundamentals of database searching

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Whole Genome Alignments and Synteny Maps

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

The Contribution of Bioinformatics to Evolutionary Thought

Example of Function Prediction

Phylogenetic inference

Tutorial 4 Substitution matrices and PSI-BLAST

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Hands-On Nine The PAX6 Gene and Protein

A Browser for Pig Genome Data

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Moreover, the circular logic

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Overview Multiple Sequence Alignment

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Exercise 5. Sequence Profiles & BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

EBI web resources II: Ensembl and InterPro

Introduction to Bioinformatics Introduction to Bioinformatics

Comparative Bioinformatics Midterm II Fall 2004

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Transcription:

ICB Fall 2003 G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2003 Oliver Jovanovic, All Rights Reserved.

Bioinformatics and Computational Biology Internet Resources National Center for Biotechnology Information (NCBI) PubMed, PubMed Central, Books and other reference material GenBank, RefSeq, CDD, MMDB and other sequence and structure databases Prokaryotic genome data and browsers (over 100 microbial, 1,000 virus and 300 plasmids) Eukaryotic genome data and browsers (9 complete genomes, maps and partial sequences) BLAST, PSI-BLAST and VAST search tools, Cn3D visualization tool http://www.ncbi.nlm.nih.gov/ Ensembl (EMBL-EBI/Sanger Institute) Eukaryotic genome data and browsers (human, mouse, rat,fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae). http://www.ensembl.org/ UCSC Genome Bioninformatics Eukaryotic genome data and browsers (human, mouse, rat). http://genome.ucsc.edu/ European Bioinformatics Institute Sequence analysis tools and databases http://www.ebi.ac.uk/ Expert Protein Analysis System (Expasy) Protein analysis and biochemical information, links to useful tools, software and references. http://us.expasy.org/ Protein Data Bank Worldwide repository for 3D protein structure data and tools. http://www.rcsb.org/pdb/

Bioinformatics and Computational Biology Software Resources IU Bio-Archive (Macintosh, Unix and Java Molecular Biology Software) http://iubio.bio.indiana.edu/ Pasteur Institute Macintosh Bioinformatics Archive ftp://ftp.pasteur.fr/pub/gensoft/macintosh/ European Bioinformatics Institute Biology Software Directory http://www.ebi.ac.uk/biocat/ Apple Computer Bioinformatics Ports to Mac OS X http://www.apple.com/scitech/stories/osxporting/index2.html European Molecular Biology Open Software Suite (EMBOSS) http://www.emboss.org/ BioTeam, Inc. Bioinformatics Tools Ports to Mac OS X http://bioteam.net/macosx/biotools-1/ Fink Scientific Tool Ports to Mac OS X http://fink.sourceforge.net/pdb/section.php/sci SourceForge http://sourceforge.net/ VersionTracker http://www.versiontracker.com

Databases Flat File Database (FFDB) A collection of similar files made useful by ordering and indexing. All the information about one sequence would be stored in one structured text file, and you generally examine one file at a time. Examples: GenBank, FileMaker Pro Relational Database (RDB) All data is stored inside one or more tables of rows and column, with all operations done on the tables themselves or producing other tables as the result. All the information about one sequence would be stored in a collection of tables with other data, so you can easily look at just the information relating to that sequence, or how it relates to the database as a whole. Structured Query Language (SQL) is used to access data in a relational database. Examples: msql, MySQL, PostgreSQL, Microsoft SQL Server, Oracle Object Oriented Databases (OODB) Data is stored and retrieved in an fashion consistent with object oriented programming principles (based on languages such as Smalltalk, C++ or Java). They generally handle complex structures and concurrent interaction by multiple clients well. Many relational databases have or are acquiring object oriented database features. Examples: PDB, Versant VDB, Gemstone GemFire

Searching Sequence Databases Needleman-Wunsch Needleman-Wunsch gives you the optimal global alignment of two sequences. This is best for comparing closely related sequences of similar lengths. Examples: GCG Gap, EMBOSS Needle Smith-Waterman Smith-Waterman gives you the optimal local alignment of two sequences. This is better for comparing distantly related sequences (where non-functional regions may have diverged). Examples: GCG BestFit, EMBOSS Water BLAST BLAST gives a fast approximation of Smith-Waterman, from 100-1,000 times faster, but will not necessarily find optimal local alignments. Examples: NCBI BLAST, WU-BLAST

Rules of Thumb for BLAST The shortest possible word size (2 for proteins, 7 for nucleotides) gives the most sensitivity, though the search may take more time. Note: A larger word size (3 for proteins, 11 for nucleotides) is the default setting for NCBI BLAST. You will have to change it manually. At least initially, run your search with the Low Complexity filter off. Then, if you appear to be getting spurious hits, or for comparison purpose, run it again with the filter on. Although it can be helpful, the filter can also filter out a significant match. Note: Filter on the default setting for NCBI BLAST. You will have to turn it off manually. Keep in mind that BLAST is a heuristic version of Smith-Waterman, and may miss a significant alignment. The default BLOSUM 62 substitution scoring matrix is best for comparing moderately distant and relatively closely related proteins. When searching for distantly related proteins, try the PAM 250 and BLOSUM 45 matrices. If comparing closely related proteins, try the PAM 1 and BLOSUM 80 matrices. PSI-BLAST can be useful for searching for very weak protein homologies. If searching with short DNA or protein sequences make sure you use the appropriate Search for short nearly exact matches BLAST page, or make sure to use those settings. BLAST is not the best tool to use for very short sequences. The Limit by entrez query option allows BLAST searches to be limited to the results of an Entrez query against the database chosen, typically one or more organisms. Common organisms are provided in a popup menu. This can yield more relevant results.

BLAST vs. Smith-Waterman

Blosum 62 vs. PAM 250

Rules of Thumb for Significance of Protein Alignments Protein Identity Significance Under 20% Unlikely to be significant 20% to 30% Gray zone may or may not be significant Over 30% Likely to be significant Keep in mind that when searching GenBank with a protein sequence it is possible to get results with a stretch of 20-40 amino acids with over 50% identity by chance alone. Identity throughout an entire protein is more likely to be significant, however, homologous proteins with a very low level of identity exist. Such distant relatives can be identified through comparison to other homologous proteins. Identity within known functional domains is more likely to be significant, and may suggest functional homology.

Definitions Identity - the extent to which two sequences are invariant. Similarity - the extent to which sequences are related, based on sequence identity and/or conservation. Conservation - changes in an amino acid sequence that preserve the biochemical properties of the original residue. This is measured in most sequence comparison algorithms by substitution matrices in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Homology - similarity attributed to descent from a common ancestor. It may or may not result in similar function. Orthologous - homologous sequences in different species that arose from a common ancestral gene. Paralogous - homologous sequences within a single species that arose by gene duplication.

Multiple Multiple (MSA) A multiple sequence alignment is an alignment of a set of sequences with structurally similar and evolutionarily homologous residues aligned in columns. In an ideal alignment, columns of aligned amino acid residues would have similar locations in the 3D structure of a protein and would diverge from a common ancestral residue. In theory, an unambigously correct evolutionary alignment exists, but can be difficult to infer and computationally intensive to calculate. Where structural data is lacking or limited, as is generally the case, it is not possible to unambiguously identify structurally similar positions. Thus, defining a single unambiguous ideal alignment can be very difficult.

Multiple Algorithms Dynamic Programming vs. Heuristic Alignment Using dynamic programming algorithms (such as Smith-Waterman or Needleman-Wunsch) to perform an optimal alignment of more than a few sequences is computationally intensive, and generally impractical for large sets of sequences or lengthy sequences. As a result, most commonly used multiple sequence alignment algorithms take a heuristic approach. One common heuristic approach is progressive alignment, in which the problem is broken down into a series of pairwise alignments. The details of how to choose the initial pair to align, how to score alignments, how to align subsequent sequences, and whether subfamilies of alignments should be created can all vary. MSA (Dynamic) This algorithm uses a technique that reduces the complexity of dynamic programming when applied to multiple sequences, and can give an optimal alignment for seven short (200-300 aa) protein sequences in a reasonable amount of time. For alignments with more or longer sequences, a heuristic approach is more practical. Feng-Doolittle (Heuristic) One of the first progressive alignment algorithms. It does not take advantage of profiles, which can increase the accuracy of the alignment. ClustalW (Heuristic) This profile based progressive alignment algorithm uses a number of heuristics to generate multiple sequence alignments, including phylogeny and scalable gap penalties.

Multiple with Text Use Fixed Width or Monospaced Fonts Each character in the font takes up the same amount of horizontal space, allowing multiple sequence alignments to properly align. Examples: Andale Mono, Courier, Courier New, Monaco, V100 Fixed Width Font Alignment (Courier):... m s h N q f q f i G n L t r D M A s R G v N K V I L V G n L G q D M A v R G I N K V I L V G R L G k D Variable Width Font Alignment (Times):... m s h N q f q f i G n L t r D M A s R G v N K V I L V G n L G q D M A v R G I N K V I L V G R L G k D

Multiple with Excel 1 50 RK2... m s h N q f q f i G n L t r D t E V R h g n s n k p q A i f d i A v n E e W R n d a. G d k E. coli M A s R G v N K V I L V G n L G q D P E V R Y m P N G G A V A N i t l A T S E S W R D K a T G E M F M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M ColIb-P9 M s a R G I N K V I L V G R L G n D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M R64 M s a R G I N K V I L V G R L G n D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M pip71a M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E i pip231a M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E t W R D K Q T G K M 51 100 RK2 q E r T d f f R i k c F G s q A E a h G k Y L g K G s l V f v q G k i R n t k y E k d. G q T v Y E. coli k E Q T E W H R V V L F G K L A E V A s E Y L R K G s Q V Y I E G Q L R T R k W t D q s G q d R Y F R E Q T E W H R V V L F G K L A E V A G E c L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y ColIb-P9 R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W d D N. G I T R Y R64 R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W d D N. G I T R Y pip71a R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y pip231a R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y 101 150 RK2 g T d f.. i a d k v d y l d t k A p G g s n Q e........................ E. coli t T E v v V n v g G T M Q M L G g r q G g g a p a g g n i g g. G Q P Q s g w g q p q q p q g G n F v T E I L V K T T G T M Q M L v r A a G a q t Q p e e g q Q f s G Q P Q p e p q a E a g t K K G G ColIb-P9 i T E I L V K T T G T M Q M L G s A p q q n a Q a q p k p Q q n G Q P Q s a d a t.... K K G G R64 i T E I L V K T T G T M Q M L G s A p q q n a Q a q p k p Q q n G Q P Q s a d a t.... K K G G pip71a v T E I L V K T T G T M Q M L G r A a G t q t Q p e e a q Q f s G Q P Q p e s q p E p.. K K G G pip231a v T E I L V K T T G T M Q M L G r A a G a q t Q p e e g q Q s a. Q P Q p e p q s E a g t K K G G % Identity % Similarity 151 181 RK2................................. 100.0 100.0 E. coli q f s g G a q s r p q Q s a P a a p s n E p p m d f d. D D I P F 32.8 56.0 F A K T K G R g R K A A Q P E P Q p Q p P E G d D Y G F S D D I P F 28.5 54.3 ColIb-P9 A K T K G R g R K A A Q P E P Q p Q t P E G e D Y G F S D D I P F 30.2 52.6 R64 A K T K G R e R K A A Q P E P Q p Q t P E G e D Y G F S D D I P F 30.2 52.6 pip71a A K T K G R e R K A A Q P E P r q p s e p a.. Y D F d D D I P F 29.3 55.2 pip231a A K T K G R g R K V A Q P E P Q l Q p P E G d D Y G F S D D I P F 29.3 54.3 Can use with any font, as Excel allows you to manually adjust the alignment.

ClustalW and ClustalX ClustalW ClustalW first generates a pairwise distance matrix for all the sequences by pairwise dynamic programming alignment. It then estimates evolutionary distance from similarity scores and constructs a guide tree using the neighbor joining distance matrix method. Dynamic progamming is then used to align the most closely related pairs of sequences. A sequence profile is constructed from these alignments, and the remaining sequences are progressively aligned to each other in order of decreasing similarity by profile-profile, profile-sequence or sequence-sequence alignment, until a complete multiple sequence alignment has been generated. ClustalW automatically chooses the optimal scoring matrix for protein alignments based on whether the sequences are close or distant neighbors in the tree. Thus it might use BLOSUM 62 (optimal for close relationships) for close neighbors, and BLOSUM45 (optimal for distant neighbors) for distant neighbors. ClustalW also allows for scalable gap penalties in protein profile alignments. A gap opening next to a highly conserved residue can be more heavily penalized than a gap opening next to an unconserved residue, for example. ClustalX This is a version of ClustalW with a graphical user interface, which is more intuitive to use, though the formatting requirements for input files need to be followed closely. It can display multiple sequence alignments onscreen, or output them as Postscript, which can easily be converted to PDF format by ESP GhostScript with GStill.

Multiple with ClustalX CLUSTAL X (1.82) MULTIPLE SEQUENCE ALIGNMENT File: tadafasta.ps Date: Wed Apr 2 12:19:01 2003 Page 1 of 2 ::.. : * :. : ::: * **:. *: V_fisch1 ----------------MDQNKSIYIEIRAQIFDVLD--AETVN---------------------SLSKE--QLHNQLSN--------------------------------AIDLLIERHEWPVSTIVRAEYVTSLVNELQGLGPLQVLM 77 V_fisch2 ----------------MNNNKALYIQLRTQIFNALE--PEALN---------------------KLTKQ--ELTQQLSN--------------------------------AVDLLIDREQLPVSLIMKNEYVESLVNELVGLGPLQNLM 77 V_vulnII1_6 ----------------MNQLKQIYLDLRDEIFDAID--ASTLS---------------------EISNE--ELAEQLSE--------------------------------SVNILIDKKQLQVSSLKRAELVKALYDELKGLGPLQKLV 77 Y_pes ----------------MIVPLKIQELMRERMLANID--INKVE---------------------LLVGDRNKLIGLLSQ--------------------------------TFDDLFNNNEYNLTTQAQKYIIEMIADEITGFGPLRELM 79 Y_ent ------------------------------MLASID--IDQVQ---------------------YLVDDYSKLSELLSQ--------------------------------TLDELFNNNDYKLTTQDQKKIITMIADEITGFGPLRELM 65 A_act -----------------MLTKQQKILLRSEVLSNLD--IEKID---------------------ELQSERSSLVNELVQ--------------------------------IVNRVANKSGAYLTSADTLVMAEIVADEIEGYGPLRDLM 78 H_aph -----------------MLTKEQQIFLRSEVLSNLD--IEKID---------------------ALQSERNLLVNELVQ--------------------------------IVNRVASKSGTYLTSADTLVMAEIVADEIEGYGPLRDLM 78 P_mul -----------------MLTKEQQVFFRNELLSNLD--IEKID---------------------EIQSERDKLVDELVQ--------------------------------VVYKVAGKGNIYITSADALFMAECIADEIDGYGPIRELM 78 H_duc -----------------MLTKDQQVFFRNALLSNLN--VDTLD---------------------EIENERSKLVTELTQ--------------------------------SLYRVANTNNIYITPYDATDMAEIVADEIGGYGPIRELM 78 A_pleur -----------------MLTKEQQIFFRTELLSNLD--VEKLD---------------------EIQNERNKLIDELTQ--------------------------------SLYRISNLHSIYLTPADAAYMAGLVADEIGGYGPIRELM 78 V_vulnI8_11 MFGN--------KTQMVNVSRGNPLVMPEAAQTAFEKLIEPSE---------------------AVKLTRKQLQQEIKK-------------------------------AVAQLSAQ-QLLPYNQSELAILVEQLCDDMLGVGPIQCLV 89 V_vulnI6_11 MFFKRKNINPEFQEKAAALEAQPSSTISDEVISDIESNVQPIDSNRVEPMQQDKKLLERQAKDKAVEEARKQLEQELAIKHYYHQRLLETLDLGLLSSLEKERAKKDLHDAIVQLMAEDQTHPMSSEGRKRVIKQIEDEVFGLGPLEPLL 150 ruler 1...10...20...30...40...50...60...70...80...90...100...110...120...130...140...150 : :.**::** :::* * : *.. :* :.*:..**:*:. * *:** ****:* *:*::*. :***:*. : :.::: : :.. **:::****:**** ** :* * :*:: V_fisch1 EDESISDIMINGYDKIFIERAGLVEVAPVSFIDEEQLLHIAKRVASQVGRRVDDSSPTCDARLADGSRVNIVIPPIAIDGTSMSIRKFKKDSIGLEKLTEFGALSQEMAQLLMIASRCRLNILISGGTGSGKTTMLNALSQYISEKERIV 227 V_fisch2 DDETITDIMINGHENVFIERDGLVEKVSVNFIDEQQLIDIAKRIASRVGRRVDESSPTCDARLEDGSRVNIVIPPIAIDGTSISIRKFKKQSIAFSDLVEFGAMSKEMAQILMVASRCRLNILISGGTGSGKTTMLNALSQFISEGERIV 227 V_vulnII1_6 ENDDISDIMINGPYDVFIEIGGKVEKSPIQFVNEKQLNTIAKRIASNVGRRIDESSPLCDARLKDGSRVNIVIPPLAIDGTSISIRKFKEQKIKLENLVEFGAMSIEMAKLLSIASHCKCNILISGGTGSGKTTLLNALSGFIGEGERVV 227 Y_pes EDDSISDIMVNGPERIFIERYGLLKLTDRRFVNNTQLTDIAKRLMQKVNRRIDEGRPLADARLIDGSRINVAISPIALDGTALSIRKFSKNKRRLEDLVDMGAMSSDMANFLIIAASCRVNIIISGGTGSGKTTLLNALSKYISEDERVI 229 Y_ent EDDSISDIMVNGPEKIFIERFGMITLTSRRFINNAQLTDIAKRLMQRANRRIDEGRPLADARLIDGSRINVAISPIALDGTVLSIRKFSNNKRKLEDLVEMGAMSSDMANFLIIAASCRVNIIISGGTGSGKTTLLNALSMYISENERVI 215 A_act ADDTINDILVNGPNDIWVERAGILEKTDKEFVSNEQLTDIAKRLVARVGRRIDDGSPLVDSRLPDGSRLNAVIAPIALDGTSISIRKFSKNKKTLQELVNFGSMTRNGE-FLNYCCRSRVNIIVSGGTGSGKTTLLNALSNYISHTERVI 227 H_aph ADDTINDILVNGPDDVWIERAGILEKTSKEFVSNEQLTDIAKRLVARVGRRIDDGSPLVDSRLPDGSRLNVVIAPIALDGTSVSIRKFSKNKKTLQELVNFGSMTREMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSNYISHSERVI 228 P_mul EDETVNDILVNGPDDVWVERAGILEKTDKKFISNEQLTDIAKRLVAKVGRRIDDGSPLVDSRLPDGSRLNVVIAPIALDGTSISIRKFSKSKKSLQELVNFGSMTREMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSNYISPKERVI 228 H_duc EDDTVNDILVNGPDNIWIERAGVLEKTNKTFINNEQLTDIAKRLVARVGRRIDEGMPLVDSRLPDGSRLNVVIQPIALDGTSISIRKFSKSKKSLQELVNFGSMTLDMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSSYISPTERVL 228 A_pleur EDEGVNDILVNGPDNIWVERAGILEKTDKKFINNEQLTDIAKRLVARVGRRIDEGMPLVDSRLPDGSRLNVVIQPIALDGTSISIRKFSKSKKSLQDLVNYGSMTLDMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSHYISHTERVL 228 V_vulnI8_11 EDPSVSDILVNGPEQIYIERQGKLLKTDIRFRDKKHLLNVAQRIVNAVGRRLDESTPLVDARLEDGSRVNIIAPPLALNGVCISIRKFPERQYDLPGLVAFGSLSEEMAQCLALAARCRLNILVSGGTGAGKTTLLNAMSTPISDDERII 239 V_vulnI6_11 HDKTVSDILVNGPKNIFVERRGKLEKTPYTFLDDRHLRNIIDRIVSQVGRRIDEASPMVDARLLDGSRVNAIIPPLALDGASVSIRRFAVDKLTMDNMLGYNSLSPQMAKFVEAAVKGELNILIAGGTGSGKTTTLNIFSGFIPSDDRII 300 ruler...160...170...180...190...200...210...220...230...240...250...260...270...280...290...300 *:**:*** * :** :::***.. *.* :: :*** *:*****:**::** ** *:.:** ******:**:.*:***:* ** * * *. *. :. :* * **:.:::* *.**.*:: * : *:* : : :: V_fisch1 TIEDAAELKLLQPHVVRLETRNSGIEGNGAITQQDLVINALRMRPDRIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANTPRDAMARVEAMVMMASNNLPLEAIRRTIVSAVDIVIQISRLHDGSRKVMSITEVIGLEGNNVVLEELYKF 377 V_fisch2 TIEDAAELKLQQPHVVRLETRTSGIEGTGVVSQRDLVINSLRMRPDRIIVGECRGGEAFEMLQAMNTGHDGSMSTLHANSPRDALSRVEAMVMMATNNLPLEAVRRTIVSAVDIVIQISRLHDGTRKVMSISEVVGLEGNNVVLEEIFAF 377 V_vulnII1_6 TIEDAAELQLQKPHIVRLETRQASVEGTGQITARDLVINALRMRPDRIIVGECRGAEAFEMLQAMNTGHDGSMSTLHANTPRDAIARTESMVMMATASLPLEAIRRTIVSAVDLIVQVRRLHDGSRKVMYISEIVGLEGNNVVMEDIFRF 377 Y_pes TLEDAAELNLEQPHVVRMETRLAGLENTGQITMRDLVINSLRMRPDRIIIGECRGEETFEMLQAMNTGHNGSMSTLHANTPRDAVARLESMIMMGPVNMPLITIRRNIASAINLIVQVSRMNDGSRKIRNISEIMGMEGEHVVLQDIFTF 379 Y_ent TLEDAAELNLEQPHVVRMETRLAGLENTGQITMRDLVINSLRMRPDRIIIGECRGEETFEMLQAMNTGHNGSMSTLHANTPRDAVARLESMIMMGPVNMPILTIRRNIASAINLIVQVSRMNDGSRKLSHISEIMGMEGDNVILQDIFSF 365 A_act TLEDTAELRLEQPHVVRLETRLAGVEHTGEVTMQDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATSRLESMVMMSNASLPLEAIRRNISSAVNIIVQASRLNDGSRKIMNITEVMGMENGQIVLQDMFSY 377 H_aph TLEDTAELRLEQPHVVRLETRLAGVEHTGEVTMKDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATSRLESMVMMSNATLPLEAIRRNIASAVNIIVQASRLNDGSRKIVNITEIMGMENGQIVLQDIFSY 378 P_mul TLEDTAELRLEQPHVVRLETRLAGVERTGEITMQDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATARLESMVMMSNASLPLEAIRRNIASAVNIIVQASRLNDGSRKIMNITELMGMENGQIVMQDIFSY 378 H_duc TLEDTAELRLEQPHVVRLETRLAGVERTGEITMQDLVINALRMRPERIIVGECRGAEAFQMLQAMNTGHDGSMSTLHANTPRDATARLESMVMMSNASLPLEAIRRNIASAVNIIIQASRLNDGSRKVMNITEVMGMENGQIVLQDIFSF 378 A_pleur TLEDTAELRLEQPHVVRLETRLAGVERTGEISMQDLVINALRMRPERIIVGECRGAEAFQMLQAMNTGHDGSMSTLHANSPRDALARLESMVMMSNASLPLEAIRRNIASAVNIIIQASRLNDGSRKVTNITEVMGMENGQIVLQDIFSY 378 V_vulnI8_11 TIEDAAELSLTQPHWIQLETRTASSEGTGAVTVRDLVKNALRMRPDRIILGEVRGAEAFDMLQAMNTGHDGSLCTLHANSPADAMLRLENMLMMGAEQIPSAVLRQQISSALDLVVQLERSHDGKRRVTAISAVGGIEQGQIVVHPLFEC 389 V_vulnI6_11 TIEDSAELQLQQPHVVRLETRPPNLEGKGEITQRDLVKNALRMRPDRIVLGEVRGAEAVDMLAAMNTGHDGSLATIHANTPRDALSRVENMFAMAGWNISTKNLRAQIASAIHLVVQMERQEDGKRRMVSIQEINGMEGEIITMSEIFHF 450 ruler...310...320...330...340...350...360...370...380...390...400...410...420...430...440...450

Displaying Sequence Data Displaying Information Take care with your choice of fixed or variable width fonts. Use fonts carefully and consistently. Avoid using fonts arbitrarily. Use black or dark text against a white or very light background (no more than 20% color) to maximize comprehension. Avoid text that blends with a background, and be cautious in using light text on a dark background. Use shading, case, bold, italic or color when appropriate, to add emphasis, contrast, or draw attention to a feature. Avoid displays in which everything blends together or lacks contrast. Align items to each other to establish a visual connection. Related items should be grouped in close proximity. Avoid simply placing items arbitrarily. Use color logically and aesthetically. Avoid the overuse of color. References The Mac is Not a Typewriter by Robin Williams The Non-Designer s Design Book by Robin Williams The Visual Display of Quantitiative Information by Edward R. Tufte Type & Layout by Colin Wheildon