Download the files protein1.txt and protein2.txt from the course website.

Similar documents
Search sequence databases 2 10/25/2016

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Maximum Likelihood Estimation

be the i th symbol in x and

Design and Analysis of Algorithms

Homework Assignment 3 Due in class, Thursday October 15

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

Chapter 11: Simple Linear Regression and Correlation

Negative Binomial Regression

DEMO #8 - GAUSSIAN ELIMINATION USING MATHEMATICA. 1. Matrices in Mathematica

Statistics MINITAB - Lab 2

Analytical Chemistry Calibration Curve Handout

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Math1110 (Spring 2009) Prelim 3 - Solutions

Kernel Methods and SVMs Extension

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Comparison of Regression Lines

Chapter 13: Multiple Regression

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

/ n ) are compared. The logic is: if the two

Exercises. 18 Algorithms

STAT 511 FINAL EXAM NAME Spring 2001

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Limited Dependent Variables

Note on EM-training of IBM-model 1

Module 9. Lecture 6. Duality in Assignment Problems

Compilers. Spring term. Alfonso Ortega: Enrique Alfonseca: Chapter 4: Syntactic analysis

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Structure and Drive Paul A. Jensen Copyright July 20, 2003

ANOVA. The Observations y ij

Maximum likelihood. Fredrik Ronquist. September 28, 2005

Common loop optimizations. Example to improve locality. Why Dependence Analysis. Data Dependence in Loops. Goal is to find best schedule:

Gravitational Acceleration: A case of constant acceleration (approx. 2 hr.) (6/7/11)

SPANC -- SPlitpole ANalysis Code User Manual

2 Finite difference basics

APPENDIX 2 FITTING A STRAIGHT LINE TO OBSERVATIONS

Problem Set 6: Trees Spring 2018

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

NUMERICAL DIFFERENTIATION

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

PhysicsAndMathsTutor.com

x = , so that calculated

= z 20 z n. (k 20) + 4 z k = 4

Problem Set 9 Solutions

STAT 3008 Applied Regression Analysis

2.3 Nilpotent endomorphisms

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Homework 9 Solutions. 1. (Exercises from the book, 6 th edition, 6.6, 1-3.) Determine the number of distinct orderings of the letters given:

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

Problem Points Score Total 100

Introduction to Regression

Hashing. Alexandra Stefan

Convergence of random processes

Linear Regression Analysis: Terminology and Notation

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Chapter 14 Simple Linear Regression

Lecture 10 Support Vector Machines II

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

AP Physics 1 & 2 Summer Assignment

Complex Numbers Alpha, Round 1 Test #123

Economics 130. Lecture 4 Simple Linear Regression Continued

Composite Hypotheses testing

THE SUMMATION NOTATION Ʃ

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

Unit 5: Quadratic Equations & Functions

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Chapter 8 SCALAR QUANTIZATION

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Profile HMM for multiple sequences

Home Assignment 4. Figure 1: A sample input sequence for NER tagging

Performance of Different Algorithms on Clustering Molecular Dynamics Trajectories

Introduction to Algorithms

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Distance-Based Approaches to Inferring Phylogenetic Trees

Homework Notes Week 7

Lecture 3: Probability Distributions

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

This column is a continuation of our previous column

STAT 3340 Assignment 1 solutions. 1. Find the equation of the line which passes through the points (1,1) and (4,5).

Lecture 3 Stat102, Spring 2007

Gaussian Mixture Models

Temperature. Chapter Heat Engine

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Report on Image warping

THE ROYAL STATISTICAL SOCIETY 2006 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Line Drawing and Clipping Week 1, Lecture 2

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

ESCI 341 Atmospheric Thermodynamics Lesson 10 The Physical Meaning of Entropy

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Split alignment. Martin C. Frith April 13, 2012

Economics 101. Lecture 4 - Equilibrium and Efficiency

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

Transcription:

Queston 1 Dot plots Download the fles proten1.txt and proten2.txt from the course webste. Usng the dot plot algnment tool http://athena.boc.uvc.ca/workbench.php?tool=dotter&db=poxvrdae, algn the proten sequences contaned n these two fles usng a wndow sze of 20. Use the proten sequence from proten1.txt as the horzontal sequence and the proten sequence from proten2.txt as the vertcal sequence. a) Descrbe the man features of the plot produced (.e. are there common domans, nsertons, deletons etc?) b) The protens that you used for part a) have GI numbers 13124807 and 45383828. Go to NCBI s proten database and use these GI numbers to look up what the actual protens are. Use the data from NCBI to explan your observatons from part a). [Hnt: the page for each proten contans a lnk to Conserved Domans data that may be helpful.] Queston 2 Nucleotde substtuton models In class, we saw varous models for nucleotde substton, lke the Jukes-Cantor and the Kmura models. These models make dfferent assumptons about the probablty of a gven nucleotde mutatng nto another nucleotde. To get an dea of the mpact of these assumptons, let s work wth models that have the followng transton probablty matrces: Model 1: A G C T A 0.91 0.03 0.03 0.03 G 0.03 0.91 0.03 0.03 C 0.03 0.03 0.91 0.03 T 0.03 0.03 0.03 0.91 Model 2: A G C T A 0.88 0.06 0.03 0.03 G 0.06 0.88 0.03 0.03 C 0.03 0.03 0.88 0.06 T 0.03 0.03 0.06 0.88

In other words, for model 1, the probablty that an A mutates nto a G after a sngle tmestep s 0.03, whereas the probablty that t stays the same s 0.91. a) What assumptons about nucleotde mutaton rates are beng made for these two models? Whch of these models do you thnk s based on more realstc assumptons? b) Suppose that at tme t = 0, a certan stretch of DNA encodes the amno acd methonne wth the nucleotdes ATG. For each of the models above, calculate the probablty that, at tme t=3, that stretch of DNA has evolved to encode the amno acd glycne, encoded by GGC. Assume each poston evolves ndependently. c) Explan the large dscrepancy n the probabltes you calculated n part b) n terms of the relatve probabltes of transtons versus transversons n the two models. d) In the Jukes-Cantor model, what s the probablty that n the lmt (e after a very large number of tme steps), a nucleotde stays the same eg a nucleotde that s C at t=0 s stll C at t = nfnty? What s the probablty n the Kmura model? How would you fgure ths out for any gven substtuton probablty matrx? Queston 3 Multple sequence algnments and phylogenetcs You wll be usng ClustalW to create a multple sequence algnment. Frst, go to the ClustalW webpage http://www.eb.ac.uk/clustalw/# and set the Output Format to aln w/numbers. a) Paste the fle protenseqs.fa from the course webste nto the wndow and run algnments wth the followng settngs: ) Usng default settngs, wth the PAM matrx ) Usng default settngs, wth the BLOSUM matrx ) Usng the BLOSUM matrx, wth the GAP OPEN penalty set to 100 and GAP EXTENSION to 0.5 v) Usng the BLOSUM matrx, wth GAP OPEN penalty set to 1 and the GAP EXTENSION penalty set to 0.5 Look at the Algnment secton of the results page and descrbe the dfferences between the algnments produced wth settngs ) and ). In addton, descrbe the dfferences between the algnments produced wth settngs ) through v) and explan why the algnments are dfferent. For the rest of ths queston, use the algnment produced wth settngs ). b) PSSMs are used n database searches (as we saw n class, wth the descrpton of PSI- BLAST), fndng motfs etc. However, some PSSMs contan more nformaton than others.e. there s less uncertanty (entropy) assocated wth whch protens are n a partcular poston n the PSSM, whch can make them a better ft for certan applcatons.

Calculate the nformaton content for the msa segment blocks assocated wth postons 25-29 and 146-150 of NP_061825.2. (Assume the postons are ndependent, and so the nformaton content at each poston can be added up to determne the nformaton content of the entre block.) Whch of these blocks has more nformaton? If you had to pck one of these segment blocks to create a PSSM that would be used for a PSI-BLAST database search, whch one would you pck, and why? Label your protens A-E n the order top down of the multple sequence algnment and use these labels for part c) and d). c) ClustalW has tree-buldng features, however we are gong to do our own phylogenetc analyss. Use the scores table located on the results page above the MSA to generate a UPGMA matrx analyss. An algnment score S can be converted to a dstance D by usng the formula D = -log 2 (0.001 * S). d) From part c), generate a rooted phylogenetc tree and wrte out the clusterng n Newck notaton. e) Look up the organsms assocated wth your accesson numbers. What would you use as an outgroup? Is ths consstent wth the tree you generated n part d)? Queston 4 Statstcs of sequence algnments Ths problem wll gve you a chance to thnk about the statstcs of sequence algnments, when matches are sgnfcant and how ths depends on the database and the query sequence. a) You have been able to solate a fragment of an ancent dnosaur gene from some well preserved fossls. Fle sequencex.fa posted on the course webste contans the amnoacd sequence correspondng to ths fragment (fragment X). BLAST fragment X aganst the nr database (use default parameters BLOSUM62 matrx, gap exstence penalty of 11 and gap extenson penalty of 1). What are the E-value and the bt score of the best ht? Is ths s a sgnfcant ht and why? b) Now go back and BLAST the same sequence usng the BLOSUM45 matrx (wth gap exstence and extenson penaltes of 15 and 2 respectvely). You wll notce that the top ht s the same as n part a), but the algnment s not. What s the qualtatve dfference between the two algnments (hnt: refer to the graphc at the top of the BALST results page). Judgng by the E-values of the hts, whch matrx s would you choose for searchng wth your sequence?

c) What are the values of parameters and for your search n parts a) and b) (scroll to the bottom of the BLAST results page)? Calculate the raw score for the top ht n each case. d) Now BLAST the same fragment X aganst the refseq database (wth the same parameters as n part a) BLOSUM62, gap exstence penalty 11, gap extenson penalty 1). As you can see, the top ht here s the same as ht 3 n part a). Compare the bt score and the E-value of these hts. Why s the bt score the same and the E-value dfferent? e) Suppose that you are lookng for hts wth E-values below 10-10. Assumng that you are usng the same BLAST parameters as n part a), what s the lowest bt score for an algnment that would satsfy ths condton f you searched aganst the nr database? hat f your searched aganst refseq? Queston 5 Constructng substtuton matrces In ths problem you wll wrte a python scrpt for dervng a substtuton matrx from a multple sequence algnment (MSA) of evolutonarly related protens. You wll follow the approach taken by Margaret Dayhoff for dervng PAM matrces. An MSA you can use for testng your program s posted on the course webste n fle evoluton-msa.txt. You can assume that the algnment gven to you contans no gaps and all sequences are of the same length (one sequence per lne, no spaces, sngle-letter amno acd code). Also assume that your program wll be called wth exactly one parameter name of the fle contanng the MSA to derve substtutons from (see sample run below). To goal of your program wll be to prnt necessary values to the screen. Your program must produce output formatted exactly as shown n the sample run. In several cases, you are gven functons that perform the approprate formattng and output. In other cases, you have to code the output yourself. Ponts wll be deducted for ncorrectly formatted results. Call your scrpt subm.py and submt t electroncally on the course webste. a) Frst, you wll need to count the number of occurrences of each amno acd n your algnment. Defne a dctonary obect occ, whose keys are amno acd 1-letter codes and values correspond to the number of occurrences of each amno acd. Prnt the contents of occ to the screen (as shown n the sample run). b) Next, you wll need to convert occurrences nto frequences. Defne a dctonary obect f wth keys correspondng to amno acd codes. f[aa ] should contan the floatng pont value correspondng to the frequency of occurrence of amno acd aa occ aa occ ). Prnt the contents of f to the screen (exactly as shown n (.e. [ ] [ ] the sample run). aa c) You wll also need to count the number of substtutons (or changes) occurrng for all pars of amno acds. Defne a dctonary obect A wth keys correspondng to all

possble 1-letter code amno acd par combnatons. A[aa aa ] should contan the number of tmes amno acd aa replaces amno acd aa n your algnment. Note that ust by lookng at the algnment we do not know n whch drecton the change happened (.e. whether t was aa aa or aa aa ), so we have to assume symmetry and count each substtuton observed n the algnment as both aa aa and aa aa. Prnt the contents of A to the screen (use functon prnt_table_nt provded below). d) Next, you have to calculate the mutablty of each amno acd. Defne a dctonary obect m wth keys correspondng to amno acd 1-letter codes. The value of m[aa ] should correspond to the floatng pont mutablty of amno acd aa,.e. A[ aaaa ] # tmes aa changes =. # occurrences of aa occ[ aa ] Prnt the contents of m to the screen (exactly as shown n the sample run). e) In order to buld a PAM1 matrx (.e. a substtuton matrx correspondng to an evolutonary perod of 1 PAM), you have to normalze your frequences such that there s on average 1 mutaton expected per 100 amno acds. For ths, you have to fnd the approprate value of the evolutonary scale factor. Ths value must satsfy the equaton: f aa m = 0. λ = 0.01 f aa m. λ [ ] [ ] 01 or [ ] [ ] aa aa Fnd the approprate value for and bnd t to a floatng pont varable lamb. Prnt the contents of lamb to the screen (exactly as shown). f) You now have all the necessary components to calculate the PAM1 matrx. Defne the matrx as a dctonary obect M wth keys correspondng to pars of amno acd sngle-letter codes. M[aa aa ] wll represent the probablty of substtuton of amno acd aa for amno acd aa (see equaton 1) and M[aa aa ] wll correspond to the probablty of amno acd aa not changng (see equaton 2). Populate these wth the approprate floatng pont values. [ ] M aa aa M λ m aa = [ ] A[ aaaa ] A[ aa aa ] k [ aa aa ] = λ m[ aa ] k 1 (equaton 2) (equaton 1) Prnt the contents of M to the screen (use functon prnt_table_float provded below). g) To make t easer to score algnments, you need to convert your probablty matrx nto a log odds rato matrx. Defne ths matrx as a dctonary obect PAM wth keys correspondng to pars of amno acd sngle-letter codes. Populate ths dctonary wth floatng pont log odds ratos:

( M [ aa aa ] f [ aa ]) + log ( M [ aa aa ] f [ aa ]) log 10 PAM [ aaaa ] = PAM [ aa aa ] = 2 Prnt the contents of PAM to the screen (use functon prnt_table_float provded below). 10 Provded code: Functon prnt_table_float takes a dctonary obect over all sngle-letter amno-acd pars (D) and a lst of sngle-letter amno-acd names (aas), and outputs the values n the dctonary correspondng to pars of amno acds n aas. The format used for output s %5.2f (.e. floatng pont format, 2 decmal places are shown and the length of the feld s 5 characters; ex: 0.52, -1.42, 21.03 ). def prnt_table_float(d, aas): for k n aas: prnt "%5s" % (k), prnt for k n aas: prnt k, for k n aas: f (abs(d[k+k]) >= 0.1 and abs(d[k+k]) <= 1.0): prnt "%5.1f" % (100*D[k+k]), elf (abs(d[k+k]) > 1.0): prnt "%5.0f" % (100*D[k+k]), else: prnt "%5.2f" % (100*D[k+k]), prnt Functon prnt_table_nt does the same as functon prnt_table_float except the output format for values s %5d (.e. nteger format, length of the feld s 6 characters; ex: 1, 12, 492, 3492, or 31492 ). def prnt_table_nt(d, aas): for k n aas: prnt "%5s" % (k), prnt for k n aas: prnt k, for k n aas: prnt "%5d" % (D[k+k]), prnt Note that n order to be able to call these functons from your code, you must put them at the top of your scrpt rght after all the relevant mport statements and ust before your own code (the man body of your scrpt). Hnts:

Many of the obects you wll use n ths program are dctonares wth keys correspondng to amno acd names. It may be useful up front to defne a lst of all sngleletter amno acd codes (especally because functons prnt_table_nt and prnt_table_float requre t). Moreover, t may be useful to use ths lst to generate dctonares wth approprate keys. Here s a smple example of how that mght work: AA = ['A','C','D','E','F','G','H','I','K','L', 'M','N','P','Q','R','S','T','V','W','Y'] S = dct(); D = dct(); for n range(len(aa)): S[AA[]] = 0 # populate dctonary prnt "S[%s] = %d" % (AA[],S[AA[]]) for n range(len(aa)): for n range(len(aa)): D[AA[]+AA[]] = 0 # populate dctonary prnt "D[%s] = %d" % (AA[]+AA[],D[AA[]+AA[]]) Dctonary obect S above can be used to store propertes of sngle amno acds. Whereas, dctonary obect D s useful for storng amno acd par propertes (such as substtuton frequences, for example). Notce how a set of two nested loops can be used to terate over all combnatons of amno acd pars. You wll often use loops for ths programmng assgnment. In some cases, you mght fnd t useful to treat some of the teratons of the loop dfferently than others. For nstance, when teratng over all pars of amno acds, you mght fnd t necessary to treat skp those teratons correspondng to the frst and the second amno acds beng the same. You can do ths wth a contnue statement: AA = ['A','C','D','E','F','G','H','I','K','L', 'M','N','P','Q','R','S','T','V','W','Y'] for aa n AA: for aa n AA: f aa == aa: contnue # do some stuff f aa s not the same as aa OR AA = ['A','C','D','E','F','G','H','I','K','L', 'M','N','P','Q','R','S','T','V','W','Y'] for aa n AA: for aa n AA: f aa <> aa: # do some stuff f aa s not the same as aa

Your program can not assume a pror what the name of the fle contanng the sequence algnment wll be. Instead, you are told that the name wll be specfed on the command lne as an argument, so your program has to fetch t. In python, functons for processng command lne arguments resde n module sys. Ths snppet of code shows how you mght read the name of a fle from command lne and process t: mport sys # Error check f (len(sys.argv)!= 2): prnt "Expected 1 argument, but got", len(sys.argv)-1 sys.ext(1) # Read all lnes n fle fle = sys.argv[1] f = open(fle); lnes = f.readlnes() f.close() Lst sys.argv contans the lst of all command lne arguments, ncludng the name of your program. That s, f you nvoke your program wth python program.py frst.txt second.txt 123, then sys.argv wll look lke [ program.py, frst.txt, second.txt, 123 ]. Remember that lst ndces start wth 0 n python. Thus, f you are lookng for the frst argument after the command name, t wll be n sys.argv[1]. Also notce how the code above checks to make sure that there are as many command lne arguments as expected (two name of our program plus name of fle wth algnment). Error checkng s generally not requred n ths course, but t s recommended (and t wll make lfe easer when you forget to specfy an argument durng testng). To fnd base 10 logarthms n python, mport the math module ( mport math at the begnnng of your program) and type math.log10(n) where n represents some number value. Remember that quotents wll be rounded down to an nteger value unless the numerator s defned as a floatng pont value. Any number can be converted to a float by typng float(n), where n s a number or varable bound to a number. The sample run shown below s also posted n fle sample.run.txt on the course web ste. It s recommended that you compare the output of your program on the sequence algnment provded wth ths output usng the Unx command dff. Sample run (also posted as sample.run.txt): [gevorg@keatng9 Sprng 06]$ head evoluton-msa.txt VDLFAGIGGFHAALKILVGCAPCQDFSQYTQCEMILAFLSFADYFRPRYFLLENVRTFKYILTPVLWKYLYRYAKKHQARGNGFGYKQMGNGVNVGVVRRLSPRETARLQGLPEWF ASLFAGVGGIDLGFDIILGGFPCQDFSMIWRGTLFWNIARIIEEREPTVLILENVRNLGYQVRFGILEAGAYGVSQSRKRAFIWAYKQLGNSVTVKVVRRLTVRECALIQSFPPDY IDLFCGVGGLTHGLDMIVGGFPCQDYSVARRNQLYVPYFGFVEEFRPKAFLIENVVGLPAIFSPHLLPAWMGGTPQVRERVFITAFKTIGNGVPFLAARKLTPRECFNFQGYPEDF GDTFCGGGGVSLGADFFTYSFPCQDISVAGRGQLYLWLKKVVEITKPKVFIAENVKGLGYAADVVTLDACEYGVPQHRRRVFIFGRRQIGNAVPPQGVRLFSELELKRLMGFPVDF IDLFAGIGGIRRGFDLLTYSFPCQDLSQQGSGDLFFETLRLIVAKKPQVIFLENVKNLYVVLDAQVLNAKNYGVAQNRERVIFIGYKQFGNSVVINVLRNFTAREGARIQSFPDTY LDIFAGCGGLSHGLKLWTMSPSCQPFTRIGRRNIIFDVLRILKKKQPKMFLLENVKGLGYTVYFKVLNTLDFGLPQKRERIYIVGYRQLGNAVPIGLGRRISAAEALAIQSLPKEF LDVFSGCGGLSEGFDILHLSPPCQTFSRAHRGTLFFETALLAEEKKPKFVILENVKGLYHIKYQVLNAKDFGNIPQNRERIYIVGITRIGNQIEKIDSRAISLREAALLQTFPRSY LDVFSGCGGLSEGFDVLLAGFPCQPFSLAGRNSLFVDFVRFVKFFSPKFFVMENVLGIGYGVSYLLLNSSTFGVPQNRVRIYILGIFVCGNSISVEVLRKLHPRECARVMGYPDSY LDVFSGCGGLSEGFDVLTGGFPCQPFSKSGRANLTLDFAKIVLAIQPAWVIMENVERAGYSVFYEVMDAQNFGLPQRRERIVIVGYRQFGNSVVVPVFRLLTTNECKAIMGFPKDF LDVFAGCGGLSEGFDVVMGGPPCQGFSTYGNGKLSQSYVDLICKNQPDFFVFENVKGLLNYGVYLILNSSNFQVPQNRLRVYIVGRKVAGNSVSVPVIPRLTVRMTARIQGFPDDW [gevorg@keatng9 ps2]$ python subm.py evoluton-msa.txt

Occurrences: A 774 C 256 D 440 E 496 F 912 G 1246 H 115 I 762 K 396 L 1150 M 177 N 488 P 702 Q 493 R 836 S 554 T 310 V 930 W 84 Y 479 Frequences: A 0.066724137931 C 0.0220689655172 D 0.0379310344828 E 0.0427586206897 F 0.0786206896552 G 0.107413793103 H 0.00991379310345 I 0.0656896551724 K 0.0341379310345 L 0.0991379310345 M 0.0152586206897 N 0.0420689655172 P 0.0605172413793 Q 0.0425 R 0.0720689655172 S 0.0477586206897 T 0.026724137931 V 0.0801724137931 W 0.00724137931034 Y 0.0412931034483 Number of substtutons: A C D E F G H I K L M N P Q R S T V W Y A 25062 2422 1179 1239 3236 7546 415 4663 1581 5160 1534 1160 3407 733 2034 7233 1206 4203 666 1947 C 2422 12348 212 167 702 804 163 1890 175 1229 279 554 301 341 389 687 452 1428 90 711 D 1179 212 17932 4473 880 1551 899 299 2190 852 404 1741 1399 1040 1903 3489 1081 678 254 1104 E 1239 167 4473 22946 706 927 806 609 2550 1540 358 2295 985 1767 2534 1985 1052 702 267 1196 F 3236 702 880 706 37742 1328 883 8268 874 10984 1569 601 2201 659 689 1023 1046 5715 1989 9193 G 7546 804 1551 927 1328 84578 446 1750 1662 2528 687 6049 1512 837 1464 4651 1453 1537 282 1762 H 415 163 899 806 883 446 582 439 451 778 70 677 220 408 753 734 517 325 212 1607 I 4663 1890 299 609 8268 1750 439 17098 1083 15614 2099 692 817 924 1030 1384 1380 11903 692 2804 K 1581 175 2190 2550 874 1662 451 1083 8872 1855 189 1861 1593 2328 7268 1342 964 1171 134 1061 L 5160 1229 852 1540 10984 2528 778 15614 1855 41500 4415 1195 1290 1788 2779 2006 2116 13205 790 2226 M 1534 279 404 358 1569 687 70 2099 189 4415 1350 203 48 1845 184 178 199 1506 57 349 N 1160 554 1741 2295 601 6049 677 692 1861 1195 203 21656 978 2258 2224 1261 1023 958 178 748 P 3407 301 1399 985 2201 1512 220 817 1593 1290 48 978 43932 845 1384 2008 1129 5060 119 270 Q 733 341 1040 1767 659 837 408 924 2328 1788 1845 2258 845 25790 2810 1774 1147 816 80 617 R 2034 389 1903 2534 689 1464 753 1030 7268 2779 184 2224 1384 2810 48610 2774 1389 1111 86 1349 S 7233 687 3489 1985 1023 4651 734 1384 1342 2006 178 1261 2008 1774 2774 15582 4527 1241 335 632 T 1206 452 1081 1052 1046 1453 517 1380 964 2116 199 1023 1129 1147 1389 4527 6632 2316 234 827 V 4203 1428 678 702 5715 1537 325 11903 1171 13205 1506 958 5060 816 1111 1241 2316 35736 561 1898 W 666 90 254 267 1989 282 212 692 134 790 57 178 119 80 86 335 234 561 548 742 Y 1947 711 1104 1196 9193 1762 1607 2804 1061 2226 349 748 270 617 1349 632 827 1898 742 16378 Mutabltes: A 66.6201550388 C 50.765625 D 58.2454545455 E 52.7379032258 F 57.6162280702 G 31.1203852327 H 93.9391304348 I 76.56167979 K 76.595959596 L 62.9130434783 M 91.3728813559 N 54.6229508197 P 36.4188034188 Q 46.6876267748 R 40.8540669856 S 70.8736462094 T 77.6064516129 V 60.5741935484 W 92.4761904762 Y 64.8079331942 lambda = 0.000174823593951 Substtuton probabltes: A C D E F G H I K L M N P Q R S T V W Y A 98.8 0.17 0.05 0.04 0.06 0.11 0.06 0.11 0.07 0.08 0.15 0.04 0.08 0.03 0.04 0.23 0.07 0.08 0.14 0.07 C 0.05 99.1 0.01 0.01 0.01 0.01 0.02 0.04 0.01 0.02 0.03 0.02 0.01 0.01 0.01 0.02 0.03 0.03 0.02 0.03 D 0.03 0.01 99.0 0.16 0.02 0.02 0.14 0.01 0.10 0.01 0.04 0.06 0.03 0.04 0.04 0.11 0.06 0.01 0.05 0.04 E 0.03 0.01 0.18 99.1 0.01 0.01 0.12 0.01 0.11 0.02 0.04 0.08 0.02 0.06 0.05 0.06 0.06 0.01 0.06 0.04 F 0.07 0.05 0.03 0.02 99.0 0.02 0.13 0.19 0.04 0.17 0.15 0.02 0.05 0.02 0.01 0.03 0.06 0.11 0.41 0.34 G 0.17 0.05 0.06 0.03 0.03 99.5 0.07 0.04 0.07 0.04 0.07 0.22 0.04 0.03 0.03 0.15 0.08 0.03 0.06 0.06 H 0.01 0.01 0.04 0.03 0.02 0.01 98.4 0.01 0.02 0.01 0.01 0.02 0.01 0.01 0.02 0.02 0.03 0.01 0.04 0.06 I 0.11 0.13 0.01 0.02 0.16 0.02 0.07 98.7 0.05 0.24 0.21 0.02 0.02 0.03 0.02 0.04 0.08 0.22 0.14 0.10 K 0.04 0.01 0.09 0.09 0.02 0.02 0.07 0.02 98.7 0.03 0.02 0.07 0.04 0.08 0.15 0.04 0.05 0.02 0.03 0.04 L 0.12 0.08 0.03 0.05 0.21 0.04 0.12 0.36 0.08 98.9 0.44 0.04 0.03 0.06 0.06 0.06 0.12 0.25 0.16 0.08 M 0.03 0.02 0.02 0.01 0.03 0.01 0.01 0.05 0.01 0.07 98.4 0.01 0.00 0.07 0.00 0.01 0.01 0.03 0.01 0.01 N 0.03 0.04 0.07 0.08 0.01 0.08 0.10 0.02 0.08 0.02 0.02 99.0 0.02 0.08 0.05 0.04 0.06 0.02 0.04 0.03 P 0.08 0.02 0.06 0.03 0.04 0.02 0.03 0.02 0.07 0.02 0.00 0.04 99.4 0.03 0.03 0.06 0.06 0.10 0.02 0.01 Q 0.02 0.02 0.04 0.06 0.01 0.01 0.06 0.02 0.10 0.03 0.18 0.08 0.02 99.2 0.06 0.06 0.06 0.02 0.02 0.02 R 0.05 0.03 0.08 0.09 0.01 0.02 0.11 0.02 0.32 0.04 0.02 0.08 0.03 0.10 99.3 0.09 0.08 0.02 0.02 0.05 S 0.16 0.05 0.14 0.07 0.02 0.07 0.11 0.03 0.06 0.03 0.02 0.05 0.05 0.06 0.06 98.8 0.26 0.02 0.07 0.02 T 0.03 0.03 0.04 0.04 0.02 0.02 0.08 0.03 0.04 0.03 0.02 0.04 0.03 0.04 0.03 0.14 98.6 0.04 0.05 0.03 V 0.09 0.10 0.03 0.02 0.11 0.02 0.05 0.27 0.05 0.20 0.15 0.03 0.13 0.03 0.02 0.04 0.13 98.9 0.12 0.07 W 0.02 0.01 0.01 0.01 0.04 0.00 0.03 0.02 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.01 0.01 0.01 98.4 0.03 Y 0.04 0.05 0.04 0.04 0.18 0.02 0.24 0.06 0.05 0.03 0.03 0.03 0.01 0.02 0.03 0.02 0.05 0.04 0.15 98.9 Log-odds scores:

A C D E F G H I K L M N P Q R S T V W Y A 117-161 -215-218 -203-180 -202-179 -198-193 -164-221 -190-241 -220-147 -199-193 -168-197 C -161 165-242 -257-221 -229-195 -171-246 -207-190 -205-247 -226-243 -201-194 -191-207 -193 D -215-242 142-138 -235-224 -144-274 -159-247 -198-178 -204-201 -198-154 -179-247 -186-197 E -218-257 -138 136-250 -252-154 -249-158 -226-208 -172-224 -183-191 -183-186 -251-189 -199 F -203-221 -235-250 110-263 -177-162 -231-167 -171-256 -216-253 -274-239 -212-186 -128-137 G -180-229 -224-252 -263 96.7-220 -243-217 -245-220 -170-246 -256-255 -186-212 -257-226 -222 H -202-195 -144-154 -177-220 200-199 -170-192 -216-161 -226-184 -180-163 -153-221 -135-123 I -179-171 -274-249 -162-243 -199 118-214 -144-150 -242-251 -230-248 -218-193 -147-166 -181 K -198-246 -159-158 -231-217 -170-214 146-208 -226-171 -193-162 -135-191 -180-219 -209-195 L -193-207 -247-226 -167-245 -192-144 -208 99.9-136 -236-249 -219-223 -219-192 -160-178 -209 M -164-190 -198-208 -171-220 -216-150 -226-136 181-232 -311-137 -260-243 -213-173 -211-208 N -221-205 -178-172 -256-170 -161-242 -171-236 -232 137-224 -172-196 -202-186 -237-206 -219 P -190-247 -204-224 -216-246 -226-251 -193-249 -311-224 122-231 -232-198 -198-180 -239-279 Q -241-226 -201-183 -253-256 -184-230 -162-219 -137-172 -231 137-186 -188-182 -244-241 -228 R -220-243 -198-191 -274-255 -180-248 -135-223 -260-196 -232-186 114-192 -196-254 -260-217 S -147-201 -154-183 -239-186 -163-218 -191-219 -243-202 -198-188 -192 132-127 -231-184 -232 T -199-194 -179-186 -212-212 -153-193 -180-192 -213-186 -198-182 -196-127 157-179 -174-195 V -193-191 -247-251 -186-257 -221-147 -219-160 -173-237 -180-244 -254-231 -179 109-184 -206 W -168-207 -186-189 -128-226 -135-166 -209-178 -211-206 -239-241 -260-184 -174-184 213-143 Y -197-193 -197-199 -137-222 -123-181 -195-209 -208-219 -279-228 -217-232 -195-206 -143 138 [gevorg@keatng9 ps2]$