G4120: Introduction to Computational Biology

Bioinformatics and Computational Biology Internet Resources National Center for Biotechnology Information (NCBI) PubMed, PubMed Central, Books and other reference material GenBank, RefSeq, CDD, MMDB and other sequence and structure databases Prokaryotic genome data and browsers (over 100 microbial, 1,000 virus and 300 plasmids) Eukaryotic genome data and browsers (9 complete genomes, maps and partial sequences) BLAST, PSI-BLAST and VAST search tools, Cn3D visualization tool http://www.ncbi.nlm.nih.gov/ Ensembl (EMBL-EBI/Sanger Institute) Eukaryotic genome data and browsers (human, mouse, rat,fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae). http://www.ensembl.org/ UCSC Genome Bioninformatics Eukaryotic genome data and browsers (human, mouse, rat). http://genome.ucsc.edu/ European Bioinformatics Institute Sequence analysis tools and databases http://www.ebi.ac.uk/ Expert Protein Analysis System (Expasy) Protein analysis and biochemical information, links to useful tools, software and references. http://us.expasy.org/ Protein Data Bank Worldwide repository for 3D protein structure data and tools. http://www.rcsb.org/pdb/

Bioinformatics and Computational Biology Software Resources IU Bio-Archive (Macintosh, Unix and Java Molecular Biology Software) http://iubio.bio.indiana.edu/ Pasteur Institute Macintosh Bioinformatics Archive ftp://ftp.pasteur.fr/pub/gensoft/macintosh/ European Bioinformatics Institute Biology Software Directory http://www.ebi.ac.uk/biocat/ Apple Computer Bioinformatics Ports to Mac OS X http://www.apple.com/scitech/stories/osxporting/index2.html European Molecular Biology Open Software Suite (EMBOSS) http://www.emboss.org/ BioTeam, Inc. Bioinformatics Tools Ports to Mac OS X http://bioteam.net/macosx/biotools-1/ Fink Scientific Tool Ports to Mac OS X http://fink.sourceforge.net/pdb/section.php/sci SourceForge http://sourceforge.net/ VersionTracker http://www.versiontracker.com

Databases Flat File Database (FFDB) A collection of similar files made useful by ordering and indexing. All the information about one sequence would be stored in one structured text file, and you generally examine one file at a time. Examples: GenBank, FileMaker Pro Relational Database (RDB) All data is stored inside one or more tables of rows and column, with all operations done on the tables themselves or producing other tables as the result. All the information about one sequence would be stored in a collection of tables with other data, so you can easily look at just the information relating to that sequence, or how it relates to the database as a whole. Structured Query Language (SQL) is used to access data in a relational database. Examples: msql, MySQL, PostgreSQL, Microsoft SQL Server, Oracle Object Oriented Databases (OODB) Data is stored and retrieved in an fashion consistent with object oriented programming principles (based on languages such as Smalltalk, C++ or Java). They generally handle complex structures and concurrent interaction by multiple clients well. Many relational databases have or are acquiring object oriented database features. Examples: PDB, Versant VDB, Gemstone GemFire

Searching Sequence Databases Needleman-Wunsch Needleman-Wunsch gives you the optimal global alignment of two sequences. This is best for comparing closely related sequences of similar lengths. Examples: GCG Gap, EMBOSS Needle Smith-Waterman Smith-Waterman gives you the optimal local alignment of two sequences. This is better for comparing distantly related sequences (where non-functional regions may have diverged). Examples: GCG BestFit, EMBOSS Water BLAST BLAST gives a fast approximation of Smith-Waterman, from 100-1,000 times faster, but will not necessarily find optimal local alignments. Examples: NCBI BLAST, WU-BLAST

Rules of Thumb for BLAST The shortest possible word size (2 for proteins, 7 for nucleotides) gives the most sensitivity, though the search may take more time. Note: A larger word size (3 for proteins, 11 for nucleotides) is the default setting for NCBI BLAST. You will have to change it manually. At least initially, run your search with the Low Complexity filter off. Then, if you appear to be getting spurious hits, or for comparison purpose, run it again with the filter on. Although it can be helpful, the filter can also filter out a significant match. Note: Filter on the default setting for NCBI BLAST. You will have to turn it off manually. Keep in mind that BLAST is a heuristic version of Smith-Waterman, and may miss a significant alignment. The default BLOSUM 62 substitution scoring matrix is best for comparing moderately distant and relatively closely related proteins. When searching for distantly related proteins, try the PAM 250 and BLOSUM 45 matrices. If comparing closely related proteins, try the PAM 1 and BLOSUM 80 matrices. PSI-BLAST can be useful for searching for very weak protein homologies. If searching with short DNA or protein sequences make sure you use the appropriate Search for short nearly exact matches BLAST page, or make sure to use those settings. BLAST is not the best tool to use for very short sequences. The Limit by entrez query option allows BLAST searches to be limited to the results of an Entrez query against the database chosen, typically one or more organisms. Common organisms are provided in a popup menu. This can yield more relevant results.

BLAST vs. Smith-Waterman

Blosum 62 vs. PAM 250

Rules of Thumb for Significance of Protein Alignments Protein Identity Significance Under 20% Unlikely to be significant 20% to 30% Gray zone may or may not be significant Over 30% Likely to be significant Keep in mind that when searching GenBank with a protein sequence it is possible to get results with a stretch of 20-40 amino acids with over 50% identity by chance alone. Identity throughout an entire protein is more likely to be significant, however, homologous proteins with a very low level of identity exist. Such distant relatives can be identified through comparison to other homologous proteins. Identity within known functional domains is more likely to be significant, and may suggest functional homology.

Definitions Identity - the extent to which two sequences are invariant. Similarity - the extent to which sequences are related, based on sequence identity and/or conservation. Conservation - changes in an amino acid sequence that preserve the biochemical properties of the original residue. This is measured in most sequence comparison algorithms by substitution matrices in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Homology - similarity attributed to descent from a common ancestor. It may or may not result in similar function. Orthologous - homologous sequences in different species that arose from a common ancestral gene. Paralogous - homologous sequences within a single species that arose by gene duplication.

Multiple Multiple (MSA) A multiple sequence alignment is an alignment of a set of sequences with structurally similar and evolutionarily homologous residues aligned in columns. In an ideal alignment, columns of aligned amino acid residues would have similar locations in the 3D structure of a protein and would diverge from a common ancestral residue. In theory, an unambigously correct evolutionary alignment exists, but can be difficult to infer and computationally intensive to calculate. Where structural data is lacking or limited, as is generally the case, it is not possible to unambiguously identify structurally similar positions. Thus, defining a single unambiguous ideal alignment can be very difficult.

Multiple Algorithms Dynamic Programming vs. Heuristic Alignment Using dynamic programming algorithms (such as Smith-Waterman or Needleman-Wunsch) to perform an optimal alignment of more than a few sequences is computationally intensive, and generally impractical for large sets of sequences or lengthy sequences. As a result, most commonly used multiple sequence alignment algorithms take a heuristic approach. One common heuristic approach is progressive alignment, in which the problem is broken down into a series of pairwise alignments. The details of how to choose the initial pair to align, how to score alignments, how to align subsequent sequences, and whether subfamilies of alignments should be created can all vary. MSA (Dynamic) This algorithm uses a technique that reduces the complexity of dynamic programming when applied to multiple sequences, and can give an optimal alignment for seven short (200-300 aa) protein sequences in a reasonable amount of time. For alignments with more or longer sequences, a heuristic approach is more practical. Feng-Doolittle (Heuristic) One of the first progressive alignment algorithms. It does not take advantage of profiles, which can increase the accuracy of the alignment. ClustalW (Heuristic) This profile based progressive alignment algorithm uses a number of heuristics to generate multiple sequence alignments, including phylogeny and scalable gap penalties.

Multiple with Text Use Fixed Width or Monospaced Fonts Each character in the font takes up the same amount of horizontal space, allowing multiple sequence alignments to properly align. Examples: Andale Mono, Courier, Courier New, Monaco, V100 Fixed Width Font Alignment (Courier):... m s h N q f q f i G n L t r D M A s R G v N K V I L V G n L G q D M A v R G I N K V I L V G R L G k D Variable Width Font Alignment (Times):... m s h N q f q f i G n L t r D M A s R G v N K V I L V G n L G q D M A v R G I N K V I L V G R L G k D

Multiple with Excel 1 50 RK2... m s h N q f q f i G n L t r D t E V R h g n s n k p q A i f d i A v n E e W R n d a. G d k E. coli M A s R G v N K V I L V G n L G q D P E V R Y m P N G G A V A N i t l A T S E S W R D K a T G E M F M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M ColIb-P9 M s a R G I N K V I L V G R L G n D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M R64 M s a R G I N K V I L V G R L G n D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M pip71a M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E i pip231a M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E t W R D K Q T G K M 51 100 RK2 q E r T d f f R i k c F G s q A E a h G k Y L g K G s l V f v q G k i R n t k y E k d. G q T v Y E. coli k E Q T E W H R V V L F G K L A E V A s E Y L R K G s Q V Y I E G Q L R T R k W t D q s G q d R Y F R E Q T E W H R V V L F G K L A E V A G E c L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y ColIb-P9 R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W d D N. G I T R Y R64 R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W d D N. G I T R Y pip71a R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y pip231a R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y 101 150 RK2 g T d f.. i a d k v d y l d t k A p G g s n Q e........................ E. coli t T E v v V n v g G T M Q M L G g r q G g g a p a g g n i g g. G Q P Q s g w g q p q q p q g G n F v T E I L V K T T G T M Q M L v r A a G a q t Q p e e g q Q f s G Q P Q p e p q a E a g t K K G G ColIb-P9 i T E I L V K T T G T M Q M L G s A p q q n a Q a q p k p Q q n G Q P Q s a d a t.... K K G G R64 i T E I L V K T T G T M Q M L G s A p q q n a Q a q p k p Q q n G Q P Q s a d a t.... K K G G pip71a v T E I L V K T T G T M Q M L G r A a G t q t Q p e e a q Q f s G Q P Q p e s q p E p.. K K G G pip231a v T E I L V K T T G T M Q M L G r A a G a q t Q p e e g q Q s a. Q P Q p e p q s E a g t K K G G % Identity % Similarity 151 181 RK2................................. 100.0 100.0 E. coli q f s g G a q s r p q Q s a P a a p s n E p p m d f d. D D I P F 32.8 56.0 F A K T K G R g R K A A Q P E P Q p Q p P E G d D Y G F S D D I P F 28.5 54.3 ColIb-P9 A K T K G R g R K A A Q P E P Q p Q t P E G e D Y G F S D D I P F 30.2 52.6 R64 A K T K G R e R K A A Q P E P Q p Q t P E G e D Y G F S D D I P F 30.2 52.6 pip71a A K T K G R e R K A A Q P E P r q p s e p a.. Y D F d D D I P F 29.3 55.2 pip231a A K T K G R g R K V A Q P E P Q l Q p P E G d D Y G F S D D I P F 29.3 54.3 Can use with any font, as Excel allows you to manually adjust the alignment.

ClustalW and ClustalX ClustalW ClustalW first generates a pairwise distance matrix for all the sequences by pairwise dynamic programming alignment. It then estimates evolutionary distance from similarity scores and constructs a guide tree using the neighbor joining distance matrix method. Dynamic progamming is then used to align the most closely related pairs of sequences. A sequence profile is constructed from these alignments, and the remaining sequences are progressively aligned to each other in order of decreasing similarity by profile-profile, profile-sequence or sequence-sequence alignment, until a complete multiple sequence alignment has been generated. ClustalW automatically chooses the optimal scoring matrix for protein alignments based on whether the sequences are close or distant neighbors in the tree. Thus it might use BLOSUM 62 (optimal for close relationships) for close neighbors, and BLOSUM45 (optimal for distant neighbors) for distant neighbors. ClustalW also allows for scalable gap penalties in protein profile alignments. A gap opening next to a highly conserved residue can be more heavily penalized than a gap opening next to an unconserved residue, for example. ClustalX This is a version of ClustalW with a graphical user interface, which is more intuitive to use, though the formatting requirements for input files need to be followed closely. It can display multiple sequence alignments onscreen, or output them as Postscript, which can easily be converted to PDF format by ESP GhostScript with GStill.

Multiple with ClustalX CLUSTAL X (1.82) MULTIPLE SEQUENCE ALIGNMENT File: tadafasta.ps Date: Wed Apr 2 12:19:01 2003 Page 1 of 2 ::.. : * :. : ::: * **:. *: V_fisch1 ----------------MDQNKSIYIEIRAQIFDVLD--AETVN---------------------SLSKE--QLHNQLSN--------------------------------AIDLLIERHEWPVSTIVRAEYVTSLVNELQGLGPLQVLM 77 V_fisch2 ----------------MNNNKALYIQLRTQIFNALE--PEALN---------------------KLTKQ--ELTQQLSN--------------------------------AVDLLIDREQLPVSLIMKNEYVESLVNELVGLGPLQNLM 77 V_vulnII1_6 ----------------MNQLKQIYLDLRDEIFDAID--ASTLS---------------------EISNE--ELAEQLSE--------------------------------SVNILIDKKQLQVSSLKRAELVKALYDELKGLGPLQKLV 77 Y_pes ----------------MIVPLKIQELMRERMLANID--INKVE---------------------LLVGDRNKLIGLLSQ--------------------------------TFDDLFNNNEYNLTTQAQKYIIEMIADEITGFGPLRELM 79 Y_ent ------------------------------MLASID--IDQVQ---------------------YLVDDYSKLSELLSQ--------------------------------TLDELFNNNDYKLTTQDQKKIITMIADEITGFGPLRELM 65 A_act -----------------MLTKQQKILLRSEVLSNLD--IEKID---------------------ELQSERSSLVNELVQ--------------------------------IVNRVANKSGAYLTSADTLVMAEIVADEIEGYGPLRDLM 78 H_aph -----------------MLTKEQQIFLRSEVLSNLD--IEKID---------------------ALQSERNLLVNELVQ--------------------------------IVNRVASKSGTYLTSADTLVMAEIVADEIEGYGPLRDLM 78 P_mul -----------------MLTKEQQVFFRNELLSNLD--IEKID---------------------EIQSERDKLVDELVQ--------------------------------VVYKVAGKGNIYITSADALFMAECIADEIDGYGPIRELM 78 H_duc -----------------MLTKDQQVFFRNALLSNLN--VDTLD---------------------EIENERSKLVTELTQ--------------------------------SLYRVANTNNIYITPYDATDMAEIVADEIGGYGPIRELM 78 A_pleur -----------------MLTKEQQIFFRTELLSNLD--VEKLD---------------------EIQNERNKLIDELTQ--------------------------------SLYRISNLHSIYLTPADAAYMAGLVADEIGGYGPIRELM 78 V_vulnI8_11 MFGN--------KTQMVNVSRGNPLVMPEAAQTAFEKLIEPSE---------------------AVKLTRKQLQQEIKK-------------------------------AVAQLSAQ-QLLPYNQSELAILVEQLCDDMLGVGPIQCLV 89 V_vulnI6_11 MFFKRKNINPEFQEKAAALEAQPSSTISDEVISDIESNVQPIDSNRVEPMQQDKKLLERQAKDKAVEEARKQLEQELAIKHYYHQRLLETLDLGLLSSLEKERAKKDLHDAIVQLMAEDQTHPMSSEGRKRVIKQIEDEVFGLGPLEPLL 150 ruler 1...10...20...30...40...50...60...70...80...90...100...110...120...130...140...150 : :.**::** :::* * : *.. :* :.*:..**:*:. * *:** ****:* *:*::*. :***:*. : :.::: : :.. **:::****:**** ** :* * :*:: V_fisch1 EDESISDIMINGYDKIFIERAGLVEVAPVSFIDEEQLLHIAKRVASQVGRRVDDSSPTCDARLADGSRVNIVIPPIAIDGTSMSIRKFKKDSIGLEKLTEFGALSQEMAQLLMIASRCRLNILISGGTGSGKTTMLNALSQYISEKERIV 227 V_fisch2 DDETITDIMINGHENVFIERDGLVEKVSVNFIDEQQLIDIAKRIASRVGRRVDESSPTCDARLEDGSRVNIVIPPIAIDGTSISIRKFKKQSIAFSDLVEFGAMSKEMAQILMVASRCRLNILISGGTGSGKTTMLNALSQFISEGERIV 227 V_vulnII1_6 ENDDISDIMINGPYDVFIEIGGKVEKSPIQFVNEKQLNTIAKRIASNVGRRIDESSPLCDARLKDGSRVNIVIPPLAIDGTSISIRKFKEQKIKLENLVEFGAMSIEMAKLLSIASHCKCNILISGGTGSGKTTLLNALSGFIGEGERVV 227 Y_pes EDDSISDIMVNGPERIFIERYGLLKLTDRRFVNNTQLTDIAKRLMQKVNRRIDEGRPLADARLIDGSRINVAISPIALDGTALSIRKFSKNKRRLEDLVDMGAMSSDMANFLIIAASCRVNIIISGGTGSGKTTLLNALSKYISEDERVI 229 Y_ent EDDSISDIMVNGPEKIFIERFGMITLTSRRFINNAQLTDIAKRLMQRANRRIDEGRPLADARLIDGSRINVAISPIALDGTVLSIRKFSNNKRKLEDLVEMGAMSSDMANFLIIAASCRVNIIISGGTGSGKTTLLNALSMYISENERVI 215 A_act ADDTINDILVNGPNDIWVERAGILEKTDKEFVSNEQLTDIAKRLVARVGRRIDDGSPLVDSRLPDGSRLNAVIAPIALDGTSISIRKFSKNKKTLQELVNFGSMTRNGE-FLNYCCRSRVNIIVSGGTGSGKTTLLNALSNYISHTERVI 227 H_aph ADDTINDILVNGPDDVWIERAGILEKTSKEFVSNEQLTDIAKRLVARVGRRIDDGSPLVDSRLPDGSRLNVVIAPIALDGTSVSIRKFSKNKKTLQELVNFGSMTREMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSNYISHSERVI 228 P_mul EDETVNDILVNGPDDVWVERAGILEKTDKKFISNEQLTDIAKRLVAKVGRRIDDGSPLVDSRLPDGSRLNVVIAPIALDGTSISIRKFSKSKKSLQELVNFGSMTREMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSNYISPKERVI 228 H_duc EDDTVNDILVNGPDNIWIERAGVLEKTNKTFINNEQLTDIAKRLVARVGRRIDEGMPLVDSRLPDGSRLNVVIQPIALDGTSISIRKFSKSKKSLQELVNFGSMTLDMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSSYISPTERVL 228 A_pleur EDEGVNDILVNGPDNIWVERAGILEKTDKKFINNEQLTDIAKRLVARVGRRIDEGMPLVDSRLPDGSRLNVVIQPIALDGTSISIRKFSKSKKSLQDLVNYGSMTLDMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSHYISHTERVL 228 V_vulnI8_11 EDPSVSDILVNGPEQIYIERQGKLLKTDIRFRDKKHLLNVAQRIVNAVGRRLDESTPLVDARLEDGSRVNIIAPPLALNGVCISIRKFPERQYDLPGLVAFGSLSEEMAQCLALAARCRLNILVSGGTGAGKTTLLNAMSTPISDDERII 239 V_vulnI6_11 HDKTVSDILVNGPKNIFVERRGKLEKTPYTFLDDRHLRNIIDRIVSQVGRRIDEASPMVDARLLDGSRVNAIIPPLALDGASVSIRRFAVDKLTMDNMLGYNSLSPQMAKFVEAAVKGELNILIAGGTGSGKTTTLNIFSGFIPSDDRII 300 ruler...160...170...180...190...200...210...220...230...240...250...260...270...280...290...300 *:**:*** * :** :::***.. *.* :: :*** *:*****:**::** ** *:.:** ******:**:.*:***:* ** * * *. *. :. :* * **:.:::* *.**.*:: * : *:* : : :: V_fisch1 TIEDAAELKLLQPHVVRLETRNSGIEGNGAITQQDLVINALRMRPDRIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANTPRDAMARVEAMVMMASNNLPLEAIRRTIVSAVDIVIQISRLHDGSRKVMSITEVIGLEGNNVVLEELYKF 377 V_fisch2 TIEDAAELKLQQPHVVRLETRTSGIEGTGVVSQRDLVINSLRMRPDRIIVGECRGGEAFEMLQAMNTGHDGSMSTLHANSPRDALSRVEAMVMMATNNLPLEAVRRTIVSAVDIVIQISRLHDGTRKVMSISEVVGLEGNNVVLEEIFAF 377 V_vulnII1_6 TIEDAAELQLQKPHIVRLETRQASVEGTGQITARDLVINALRMRPDRIIVGECRGAEAFEMLQAMNTGHDGSMSTLHANTPRDAIARTESMVMMATASLPLEAIRRTIVSAVDLIVQVRRLHDGSRKVMYISEIVGLEGNNVVMEDIFRF 377 Y_pes TLEDAAELNLEQPHVVRMETRLAGLENTGQITMRDLVINSLRMRPDRIIIGECRGEETFEMLQAMNTGHNGSMSTLHANTPRDAVARLESMIMMGPVNMPLITIRRNIASAINLIVQVSRMNDGSRKIRNISEIMGMEGEHVVLQDIFTF 379 Y_ent TLEDAAELNLEQPHVVRMETRLAGLENTGQITMRDLVINSLRMRPDRIIIGECRGEETFEMLQAMNTGHNGSMSTLHANTPRDAVARLESMIMMGPVNMPILTIRRNIASAINLIVQVSRMNDGSRKLSHISEIMGMEGDNVILQDIFSF 365 A_act TLEDTAELRLEQPHVVRLETRLAGVEHTGEVTMQDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATSRLESMVMMSNASLPLEAIRRNISSAVNIIVQASRLNDGSRKIMNITEVMGMENGQIVLQDMFSY 377 H_aph TLEDTAELRLEQPHVVRLETRLAGVEHTGEVTMKDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATSRLESMVMMSNATLPLEAIRRNIASAVNIIVQASRLNDGSRKIVNITEIMGMENGQIVLQDIFSY 378 P_mul TLEDTAELRLEQPHVVRLETRLAGVERTGEITMQDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATARLESMVMMSNASLPLEAIRRNIASAVNIIVQASRLNDGSRKIMNITELMGMENGQIVMQDIFSY 378 H_duc TLEDTAELRLEQPHVVRLETRLAGVERTGEITMQDLVINALRMRPERIIVGECRGAEAFQMLQAMNTGHDGSMSTLHANTPRDATARLESMVMMSNASLPLEAIRRNIASAVNIIIQASRLNDGSRKVMNITEVMGMENGQIVLQDIFSF 378 A_pleur TLEDTAELRLEQPHVVRLETRLAGVERTGEISMQDLVINALRMRPERIIVGECRGAEAFQMLQAMNTGHDGSMSTLHANSPRDALARLESMVMMSNASLPLEAIRRNIASAVNIIIQASRLNDGSRKVTNITEVMGMENGQIVLQDIFSY 378 V_vulnI8_11 TIEDAAELSLTQPHWIQLETRTASSEGTGAVTVRDLVKNALRMRPDRIILGEVRGAEAFDMLQAMNTGHDGSLCTLHANSPADAMLRLENMLMMGAEQIPSAVLRQQISSALDLVVQLERSHDGKRRVTAISAVGGIEQGQIVVHPLFEC 389 V_vulnI6_11 TIEDSAELQLQQPHVVRLETRPPNLEGKGEITQRDLVKNALRMRPDRIVLGEVRGAEAVDMLAAMNTGHDGSLATIHANTPRDALSRVENMFAMAGWNISTKNLRAQIASAIHLVVQMERQEDGKRRMVSIQEINGMEGEIITMSEIFHF 450 ruler...310...320...330...340...350...360...370...380...390...400...410...420...430...440...450

Displaying Sequence Data Displaying Information Take care with your choice of fixed or variable width fonts. Use fonts carefully and consistently. Avoid using fonts arbitrarily. Use black or dark text against a white or very light background (no more than 20% color) to maximize comprehension. Avoid text that blends with a background, and be cautious in using light text on a dark background. Use shading, case, bold, italic or color when appropriate, to add emphasis, contrast, or draw attention to a feature. Avoid displays in which everything blends together or lacks contrast. Align items to each other to establish a visual connection. Related items should be grouped in close proximity. Avoid simply placing items arbitrarily. Use color logically and aesthetically. Avoid the overuse of color. References The Mac is Not a Typewriter by Robin Williams The Non-Designer s Design Book by Robin Williams The Visual Display of Quantitiative Information by Edward R. Tufte Type & Layout by Colin Wheildon