Incorporating dependence into models for DNA motifs

Size: px

Start display at page:

Download "Incorporating dependence into models for DNA motifs"

Tobias Little
6 years ago
Views:

1 Incorporating dependence into models for DNA motifs Terry Speed & Xiaoyue Zhao University of California at Berkeley Department of Human Genetics, UCLA May 17,

2 The objects of our study DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units. DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure. 2

3 Motifs - Sites - Signals - Domains For this talk, I ll use these terms interchangeably to describe recurring elements of interest to us. In PROTEINS we have: transmembrane domains, coiled-coil domains, EGF-like domains, signal peptides, phosphorylation sites, antigenic determinants,... In DNA / RNA we have: enhancers, promoters, terminators, splicing signals, translation initiation sites, centromeres,... 3

4 Why (probability) models for biomolecular motifs? to characterize them to help identify them for incorporation into larger models, e.g. for an entire gene 4

5 Motifs and models Motifs typically represent regions of structural significance with specific biological function. Are generalisations from known examples. The models can be highly specific. Multiple models can be used to give higher sensitivity & specificity in their detection. Can sometimes be generated automatically from examples or multiple alignments. 5

6 The use of models for motifs Can be descriptive, predictive or everything else in between..almost business as usual. However, stochastic mechanisms should never be taken literally, but nevertheless they can be amazingly useful. Care is always needed: a model or method can break down at any time without notice. Biological confirmation of predictions is almost always necessary. 6

7 Transcription initiation in E. coli In E. coli transcription is initiated at the promotor, and the sequence of the promotor is recognised by the Sigma factor of RNA polymerase. 7

8 Determinism 1: consensus sequences σ Factor Promotor consensus sequence σ 70 TTGACA TATAAT σ 28 CTAAA CCGATAT Similarly for σ 32, σ 38 and σ 54. Consensus sequences have the obvious limitation: there is usually some deviation from them. 8

9 The human transcription factor Sp1 has 3 Cys-Cys-His-His zinc finger DNA binding domains 9

10 Determinism 2: regular expressions The characteristic motif of a Cys-Cys-His-His zinc finger DNA binding domain has regular expression C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-X(3,5)-H Here, as in algebra, X is unknown. The 29 a.a. sequence of our example domain 1SP1 is as follows, clearly fitting the model. 1SP1: KKFACPECPKRFMRSDHLSKHIKTHQNKK 10

11 Searching with regular expressions c.{2,4}c...[livmfywc]...h.{3,5}h PatternFind output [ISREC-Server] Date: Wed Aug 22 13:00:41 MET gp AF AEB01ABAC4F945 nuclear protein NP94b [Homo sapiens] Occurences: 2 Position : 514 CYICKASCSSQQEFQDHMSEPQH Position : 606 CTVCNRYFKTPRKFVEHVKSQGH... 11

12 Regular expressions can be limiting The regular expression syntax is still too rigid to represent many highly divergent protein motifs. Also, short patterns are sometimes insufficient with today s large databases. Even requiring perfect matches you might find many false positives. On the other hand, some real sites might not be perfect matches. We need to go beyond apparently equally likely alternatives, and ranges for gaps. We deal with the former first, having a distribution at each position. 12

13 Cys-Cys-His-His profile: sequence logo form A sequence logo is a scaled position-specific a.a.distribution. Scaling is by a measure of a position s information content. (Note that we ve lost the option of variable spacing.) 13

14 Weight matrix model (WMM) = Stochastic consensus sequence A A C C G G T T Counts from 242 known σ 70 sites Relative frequencies: f bl A C G T log 2 f bl /p b Informativeness:2+Σ b p bl log 2 p bl 14

15 Interpretation of weight matrix entries candidate sequence CTATAATC... aligned position Hypotheses: S=site (and independence) R=random (equiprobable, independence) pr(ctataa S) pr(ctataa R) log 2 = log 2 = (2+log 2.09)+...+(2+log 2.01) = x.03x.26x.13x.51x.01.25x.25x.25x.25x.25x.25 { } Generally, score s bl = log f bl /p b l=position, b=base p b =background frequency 15

16 Move the matrix along the sequence and score each window. Use of the matrix to find sites A C G T C T A T A A T C sum -93 Peaks should occur at the true sites. A C G T Of course in general any threshold will have some false positive and false negative rate. A C G T

17 Modelling motifs: the next steps Missing from the weight matrix models of motifs are good ways of dealing with: Length distributions for insertions/deletions Local and non-local association of amino acids Hidden Markov Models (HMM) help with the first. Dealing with the second remains a hard unsolved problem, but we ll describe a start. 17

18 Hidden Markov Models Processes {(S t, O t ), t=1, }, where S t is the hidden state and O t the observation at time t, such that pr(s t O t-1,s t-1,o t-2,s t-2 ) = pr(s t S t-1 ) pr(o t S t,o t-1,s t-1, O t-2,s t-2, ) = pr(o t S t, S t-1 ) The basics of HMM were laid bare in a series of beautiful papers by L E Baum and colleagues around 1970, and their formulation has been used almost unchanged to this day. 18

19 The algorithms As the name suggests, with an HMM the series O = (O 1,O 2,O 3,., O T ) is observed, while the states S = (S 1,S 2,S 3,., S T ) are not. There are elegant algorithms for calculating pr(o θ), arg max θ pr(o θ) in certain special cases, and arg max S pr(s O,θ). Here θ are the parameters of the model, e.g. transition and observation probabilities. 19

20 Profile HMM = stochastic regular expressions M = Match state, I = Insert state, D = Delete state. To operate, go from left to right. I and M states output 20 amino acids; B, D and E states are silent.

21 How profile HMM are used Instances of the motif are identified by calculating log{pr(sequence M)/pr(sequence B)}, where M and B are the motif and background HMM. Alignments of instances of the motif to the HMM are found by calculating arg max states pr(states instance, M). Estimation of HMM parameters is by calculating arg max parameters pr(sequences M, parameters). In all cases, we use the efficient HMM algorithms. 21

22 Pfam domain-hmm Pfam is a library of models of recurrent protein domains. They are constructed semi-automatically using profile hidden Markov models. Pfam families have permanent accession numbers and contain functional annotation and crossreferences to other databases, while Pfam-B families are re-generated at each release and are unannotated. See 22

23 Beyond independence Weight array matrices, Zhang & Marr (1993) consider dependencies between adjacent positions, i.e. (nonstationary) first-order Markov models. The number of parameters increases exponentially if we restrict to full higher-order Markovian models. Variable length Markov models, Rissanen (1986), Buhlmann & Wyner (1999), help us get over this problem. In the last few years, many variants have appeared: all make use of trees. [The interpolated Markov models of Salzberg et al (1998) address the same problem.] 23

24 Our aim and some notation L: length of the the sequence motif. X i : discrete random variable at position i, taking values from a finite set χ. Given a number of instances of a sequence motif x = (x 1,, x L ) of length L, we want a model for the probability P(x) of x. We denote by x ij (i < j) the sequence (x j, x j-1,., x i ) in reverse time order. 24

25 Variable Length Markov Models Factorize P(x) in the usual telescopic way: P(X 1 =x 1 ) Π L 2 P(X l = x l X l-1 1 = x l-1 1 ), then simplify this using context functions c l, l=2,..l, to P(X 1 =x 1 ) Π L 2 P(X l = x l c l (X l-1) 1 ) = c l (x l-1 1 )), where c l : x l-1 1 x l-1 l-m is suitably defined on l-1 tuples. 25

26 VLMM, cont. Here c l : χ l-1 i=0 l-1 χ i, and m = m l is given by m l (x 1 l-1 ) = min {r: P(X l = x X 1 l-1 = x 1 l-1 ) = P(X l = x X l-r l-1 = x l-r l-1 ) for all x χ }. The function c l defines the sequence-specific context, and m l defines the sequencespecific memory or order of the Markov property for position l. 26

27 VLMM: an illustrative example A full set of 16 contexts of order Pruned set of 12 contexts: P(X 3 X 2 =C,X 1 =G) = P(X 3 X 2 =C), etc.

28 VLMM cont. A VLMM for a biomolecular motif of length L is specified by a distribution for X 1, and, for l = 2, L, a constrained distribution for X l given X l-1,,x 1. That is, we need L-1 context functions, or trees. But, there is a difficulty here. 28

29 Sequence dependencies (interactions) are not always local 3-dimensional folding; DNA, RNA & protein interactions The methods outlined so far all fail to incorporate long-range ( 4 bp or a.a.) interactions. New model types are needed. 29

30 Modeling long-range dependency The principal work in this area is Burge & Karlin s (1997) maximal dependence decomposition (MDD). More recently, Cai et al (2000) and Barash et al (2003) used Bayes networks (BN). Ellrott et al (2002) optimized the sequence order in which a stationary Markov chain models the motif. We have adapted this last idea, to give permuted variable length Markov models (PVLMM). Potamianos & Jelinek (1998) have related work on decision trees (PVLMM(D)). 30

31 Maximal Dependence Decomposition MDD starts with a mutual independence model as with WMMs. The data are then iteratively subdivided, at each stage splitting on the most dependent position, suitably defined. At the tips of the tree so defined, a mutual independence model across all remaining positions is used. The details can vary according to the splitting criterion (Burge & Karlin used χ 2 ), the actual splits (binary, etc), and the stopping rule. However, the result is always a single tree. 31

32 Issues in modeling short motifs In any study of this kind, essential items are: the model class (e.g. VLMM) the way we search through the model class (e.g. by forward selection) the way we compare models when searching (e.g. by χ 2 ), and finally, the way we assess the final model in relation to our aims (e.g. by cross-validation). We always need interesting, high-quality datasets. 32

33 Model classes and model search For illustrative purposes, we will compare WMM, WAM, MDD and PVLMM(decision) for TFBS and splice donor recognition. Here we search using a simple procedure: recursively choosing the best extension of a current model, or forward selection. Our slower alternative moves through the models using reversible jump Markov chain Monte Carlo (RJMCMC). 33

34 Model comparison We fit the models using maximum likelihood, and compare fitted models using both AIC and BIC, standard penalties for model complexity. Better than either of these two is approximate normalized maximum likelihood (NML), Barron et al (1998). We use mixture models for the data with Jeffreys (Dirichlet) priors. 34

35 A simple illustration: transcription factor binding sites These are of great interest, their signals are very weak, and we typically have only a few instances. We have studied 43 TFBS with effective length 9 and 20 instances. In 17/43 cases we are able to improve upon WMM, the current standard; in 26/43, we cannot. 35

36 20 instances of P$DOF3_01 GTCTAAAGCGT aattaaagtaa GACGAAAGCAA aattaaagtgc GTCTAAAGCga GCGAAAAGCGA GCGTAAAGCAG TAGAAAAGGCG aattaaagtac CACAAAAGCCC GCCCAAAgatc tgacaaagcgt GCGGAAAgatc aattaaagcaa CAAAAAAGGCG taaaaaaggct CAGCAAAGACg GGAAAAAGCAA AGCAAAAGTGC GCAGAAAGTCA 36

37 Modelling P$DOF3_01 Sn Sp WMM.15 MDD.60 PVLMM(D).55 37

38 Now we touch on finding protein-coding genes 38

39 12 examples of 5 splice (donor) sites exon TCGGTGAGT intron TGGGTGTGT CCGGTCCGT ATG GTAAGA TCT GTAAGT CAGGTAGGA CAGGTAGGG AAGGTAAGG AGGGTATGG TGGGTAAGG GAGGTTAGT CATGTGAGT 39

40 Sequence logo for human splice donor sites Base A C G T

41 Splice site dataset Human splice donor sequences from SpliceDB, Burset et al (2001): 15,155 canonical donor sites of length 9, with GT conserved at positions 0 and 1 47,495 false donor sites from the set of all sequences which lie within 40 bp on both sides of the characteristic donor dinucleotide GT. 41

42 Part of a context (decision) tree for position -2 of a splice donor PVLMM Node #s:counts; Edge #s:split variables. Sequence order: +2(A/G) +5(T) -1(G) +4(G) -2(A) +3(A) -3(A) 42

43 Parts of MDD trees for splice donors In each case, splits are into the most frequent nt vs the others. 43

44 Model assessment: Stand-alone splice site recognition M: motif model B: background model Given a sequence x = (x 1,, x L ), we predict x to be a motif (here splice donor) if log {P(x M) / P(x B)} > c, for a suitably chosen threshold value c. 44

45 Model assessment: terms TP: true positives, TN: true negatives, FP: false positives FN: false negatives Sensitivity (sn) and specificity (sp) are given by sn = TP / [TP + FN] sp = TP / [TP + FP]. A 5-fold cross-validation is used in assessing performance. 45

46 Optimal permutation: Sp vs Sn Comparison of PVLMM decision tree (NML, ord = 5), 46 MDD (chi-square), WAM and WMM.

47 Model assessment: Integrated recognition For this assessment, we integrate the splice donor models into SLAM, Pachter et al (2002), a eukaryotic cross-species gene finder. The training data consists of 3,735 aligned human and mouse gene sequences. The resulting SLAM model is then tested on the Rosetta set of 117 single human gene sequences. 47

48 Results at the nucleotide level VLMM(D) MDD PVLMM(D) Sensitivity Specificity Results at the exon level VLMM(D) MDD PVLMM(D) Correct 362/ / /464 Partial 84/460 83/465 78/464 Wrong 14/460 17/465 15/464 Missing 25/465 23/465 22/465 Sensitivity Specificity 362/ / / / / /464 48

49 Interpretation of PVLMM model selected We use sequence logos to provide simple interpretations of our selected PVLMM, including the optimal permutation. 49

50 Beginning of the splicing process splice donor splice acceptor 50

51 Long-range dependence in the chosen model U1 sn RNA G U C C A U U C A Optimal permutation:

52 The context tree for

53 Some future work Joint modelling of human and mouse sites Joint modelling of multiple motifs in one species Including indels and dependence 53

54 Acknowledgements Xiaoyue Zhao, UCB Mauro Delorenzi (ISREC) Sourav Chatterji, UCB The SLAM team: Simon Cawley, Affymetrix Lior Pachter, UCB Marina Alexandersson, FCC 54

55 55

56 References Biological Sequence Analysis R Durbin, S Eddy, A Krogh and G Mitchison Cambridge University Press, Bioinformatics The machine learning approach P Baldi and S Brunak The MIT Press, 1998 Post-Genome Informatics M Kanehisa Oxford University Press,

57 Bayes networks (BN): an example Here each node corresponds to a sequence position, and the tree 57 defines conditional independence constraints on the distribution.

58 To find genes, we need to model splice sites 58

59 Weight matrix models, Staden (1984) Base A C G T A weight matrix for donor sites. Essentially a mutual independence model. An improvement over the consensus CAGGTAAGT. 59

60 Transcription factor binding site (TFBS) recognition We extracted all known TFBS from the TRANSFAC database with a) length 9, and b) 20 known sites. In all this gave 1,419 sites corresponding to 43 TF. Next we randomly inserted each site into a background sequence of length 1,000 simulated from a stationary 3rd order MM, with parameters estimated from a large collection of human sequence upstream of genes. Finally, we used the PVLMM, MDD and WMM to scan these sequences within a 10-fold cross-validation framework, to select a number of top-scoring sequences as putative binding sites. We always made this number equal to the true number in the60 sequences, and so sn = sp.

61 Human Transcription Factor Binding Sites 61

62 TFBS: three results P$DOF3.01 wmm pvlmm mdd V$CIZ.01 wmm pvlmm mdd P$EMBP1.Q2 wmm pvlmm mdd Entry: sensitivity/specificity Of the 43 TF, our dependence methods led to no improvement in 27 cases. 62

63 TFBS: more results 63

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in