BIOINFORMATICS KT Maastricht University Faculty of Humanities and Science Knowledge Engineering Study TRIAL EXAMINATION MASTERS KT-OR Examiner: R.L. Westra Date: March 30, 2007 Time: 13:30 15:30 Place: TS06, room 1.014 Notes: 1. The exam is an open-book exam. 2. The textbooks and lecture slides can be used during the exam. 3. The exam consists of x pages (including this page). 4. The exam time is 3 hours (180 minutes). 5. The number of exam questions is 5. 6. The number of points for each question is given (in bold). 7. The maximum number of points is 10. 8. The final exam grade is the sum of the points of the questions answered correctly. 9. The final course grade is the sum of the final exam grade plus the bonus grade that you earned from the student lectures, mini-exams, and skills class hand-ins. The final course grade will be rounded to 10. 10. Before answering the questions, please first read all the exam questions, and then make a plan to spend the three hours. 11. When answer the questions please do not forget: to write your name and student number on each answer page; to number the answers; and to number the answer pages. 1
EXERCISE 1: Short questions: 2 points i. What is a change point, and how could you detect it? ii. What is the importance of the KA/KS ratio? iii. How can you find the root on an unrooted phylogenetic tree? iv. Which operations on a chromosome can change block synteny? v. Why is average linkage better than single linkage? EXERCISE 2: Statistical sequence analysis: 2 points Suppose that we analyse a DNA sequence with the following observed multinomial distribution: p(a) = 0.5 p(c) = 0.1 p(g) = 0.3 p(t) = 0.1 Moreover, suppose that we also observe the following di-nucleotide frequencies: *A *C *G *T A* 0.10 0.10 0.09 0.10 C* 0.02 0.08 0.05 0.08 G* 0.07 0.05 0.07 0.02 T* 0.05 0 0.08 0.04 Use this information to determine unusual dimers (=di-nucleotides). 2
EXERCISE 3: Genetic distance: 2 points As the time of divergence between two sequences grows, the count of differences, d, increases. The true genetic distance, K, between the sequences, however, is not equal to d. a. Explain this phenomenon and describe a method for correcting d to estimate K. b. Describe how the role of transitions and transversions can improve this estimate for K. c. Suppose that two sequences of equal length of 1000 nucleotides, differ at 150 positions. Estimate, following the Jukes- Cantor model, the true genetic distance K, including the magnitude of the error. d. Determine when, for two sequences of equal lengths n, following the Jukes- Cantor model, determining the genetic distance becomes entire inconclusive because the variance exceeds the estimate K. EXERCISE 4: Sequence alignment: 2 points In sequence alignment sequences are compared in order to determine their degree of resemblance. a. Describe the difference(s) between local and global alignment. Suppose that in global alignment we use the following simple scoring function: σ(-,a) = σ(a,-) = σ(a,b) = -1 for all a b σ(a,b) = 1 if a = b for all letters a,b in {A,C,G,T}, and with - representing an indel. Now, suppose we have the following two amino acid sequences: s 1 = FILM s 1 = FEIM b. Write down the dynamic programming table resulting from the global alignment of the two strings s 1 and s 2 using the Needleman-Wunsch algorithm. c. Write down the optimal global alignment of s 1 and s 2 according to the dynamic programming table obtained above. 3
EXERCISE 5: Hidden Markov model: 2 points Consider an odorant receptor, i.e. a protein molecule that sticks in the cell membrane, extending in both the interior and exterior of the cell. This is visualised in the picture below, where the fat curved line represents the folded protein. The floating of this molecule in the cell mebrane is caused by parts of the molecule that likes to be in the membrane (hydrophylic) and parts that definitily not like this (hydrophobic). Therefore, we consider that a part of the molecule can be in two states: H + = hydrophylic, and H = hydrophobic. We employ a Hidden Markov model to estimate the hydrophobic and hydrophylic parts of the molecule, represented as: H + 0.3 H - 0.2 A: 0.1 R: 0.2 N: 0 D: 0.5 C: 0.2 A: 0.3 R: 0.2 N: 0.4 D: 0 C: 0.1 Suppose that we have the following protein sequence: NARNRDCCRN Determine the most likeli hidden sequence of hydrophobicity-states over the molecule using the Viterbi algorithm. 4
ANSWERS ANSWER to EXERCISE 1 Look in book ANSWER to EXERCISE 2 Divifde each possition p2[i,j] by p1[i]*p1[j] and look where this deviates considerably from 1. ANSWER to EXERCISE 3: The Jukes - Cantor Correction As the time of divergence between two sequences increases the probability of a second substitution at any one nucleotide site increases and the increase in the count of differences is slowed. This makes these counts not a desirable measure of distance. In some way, this slow down must be accounted for. The solution to this problem was first noted by Jukes and Cantor (1969; Evol.of Protein Molecules, Academic Press). Instead of calculating distance as a simple count take the distance as (Kimura and Ohta 1972; J.Mol.Evol.2:87-90). A plot of this function for the same range of parameters as in Figure 1 is given in Figure 2. This figure shows that this distance measure increases linearly with time (this is one property that is desirable for a distance measure). This is termed the Jukes & Cantor correction to distance and clearly indicates that divergence is a logarithmic function of time. Observe the large increase in the variance as time increases. As D gets closer and closer over time to 0.75 the variance increases. In the limit as D approaches 0.75, the variance approaches infinity. This is an indication that the measure of distance becomes increasingly less reliable as time increases. Note that in expectation D is less than 0.75 but in reality D can be greater than 0.75. If this is the case then a Jukes-Cantor correction cannot be done - is undefined because the argument of the logarithm will be zero. In this case you can apply a method developed by Tajima (1993, MBE 10:677-688). He suggests using the modified estimator 5
where and With variance This is actually just a different formulation of the same quantity using a Taylor series expansion to avoid the logarithm. This estimator of distance is defined for all parameter values and actually has less bias than Jukes and Cantor's original correction for small levels of divergence. 6
a: Jukes-Cantor b: Kimura 2-param c: d = 150/1000 => K = 167.3577, Var K ~ 1.9922*10-4, => error = n*sqrt(vark) = 1000*V0.0002 ~ 14.1145 => K = 167 +/- 14 d. n = -d(1-d)/((1-4d/3)^2ln(1-4d/3)) ANSWER to EXERCISE 4 a. book b. See book pages 55-58, esp p58 c. F I L M I I I F E I - M ANSWER to EXERCISE 5 See book page 75 7