Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 31 11 Motif Finding Sources for this section: Rouchka, 1997, A Brief Overview of Gibbs Sapling. J. Buhler, M. Topa: Finding otifs using rando projections, RECOMB 2001, 69-75. P.A. Pevzner and S.-H. Sze, Cobinatorial approaches to finding subtle signals in DNA sequences, ISMB 2000, 269 278. The goal of otif finding is the detection of new, unknown signals in a set of sequences. For exaple, assue we have a collection of s that appear to be co-regulated and have the sae pattern of expression. You would like to search for a transcription factor binding site that occurs in the upstrea regions of all the s. search for coon otif here The ai is to copute soething like this, where we highlight a otif that occurs in all given DNA sequences: 11.1 The Gibbs sapling ethod We will first describe an algorith based on the Gibbs-sapler. Basic assuptions:

32 Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 Input is a set of DNA sequences s 1,..., s. Each sequence contains a copy or instance of the otif. The otif has a fixed length l. The otif is rated by a position weight atrix. 11.1.1 Sequence rating odels We describe two odels for randoly rating DNA sequences: 1. The background odel B is given by a vector that specifices a probability for each of the four character states. 2. The otif odel M specifies for each position i = 1, 2,... a probability M(i, c) for each of the four states c for a word of length l: A C G T 1 2 3... l We will refer to this as a position weight atrix, although usually it is required that a position weight atrix contains the logariths of the probabilities rather than the probabilities theselves. 11.1.2 Siple coputations If we know both M and B then we can score any word w of length l. In ore detail, with l l P (w) = B(w[i]) and Q(w) = M(i, w[i]) i=1 i=1 we can define the log-odds ratio as R(w) = log Q(w) P (w). If R(w) > 0, it is ore likely that the word w was rated by the otif odel. If R(w) < 0, then it is ore likely that w was rated by the rando odel. If both M and B are given, then we can ake a axiu likelihood prediction of the locations of the otif in sequence of the input sequences:

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 33 s s 1 s 2 3 a 1 a 3 a 2 s a In each sequence s i, choose a location a i such that the word w starting at a i axiizes R(w). Vice versa, if we are given the locations a 1,..., a of the instances of the otif in s 1,..., s, then we can setup M and B for all states c and positions i as follows: and M(i, c) = B(c) = # occurences of state c outside of otif # all bases outside of otif # occurrences of c at the i-th position of the otif. For exaple consider the following sequences and otif instances: A C G T A A G C G T T A A C T T T G We have B = and M = A C G T 1 2 3... l Note: to avoid zeros in either distributions, pseudo counts are added to all counts. The axiu likelihood score of a choice of otif locations a 1,..., a is defined as follows: L(a 1,..., a ) = L(a 1,..., a, M, B) = = P (s 1,..., s a 1,..., a, M, B) = ( ai ) ( 1 i=1 j=1 B(s ai +l 1 i[j]) j=a i ) M(j, s i [j]) ) ( si j=a i +l B(s i[j]) where s i = length of s i. In this forula, we use either the background odel or the otif odel to assign a probability to a given position in a given sequence depending on whether the position is outside of, or inside of, the current placeent of the otif, respectively., 11.1.3 Motif-finding dilea Motif-finding dilea: If we know M and B, then we can deterine the ost likely otif locations a 1,..., a in s 1,..., s.

34 Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 If we know a 1,..., a then we can deterine M and B. In otif-finding we know neither and therefore want to infer both. Idea: Construct M and B fro a i s fro 1 sequences and then use M and B to deterine a value for a i for reaining sequence s i. Repeat. The Gibbs-sapler is based on this idea. This ethod can be interpreted as an MCMC approach in which proposed new solutions are always accepted. 11.1.4 Gibbs sapling algorith Algorith 11.1.1 (Gibbs-sapler-based otif finding) Input: Sequences s 1,..., s Output: Motif M, background B, locations a 1,..., a Init.: Choose a 1,..., a either randoly or using an algorith, as described later. repeat for h = 1,..., do Copute M and B fro a 1,..., a h 1, a h+1,..., a (i.e., fro data with h-th sequence and otif reoved) Using M and B, copute a new location a h in s h ( ) Copute M and B fro a 1,..., a (i.e., fro data with new choice of h-th otif location inserted) Copute score L(a 1,..., a ) until score L(a 1,..., a ) has converged Return M, B and a 1,..., a end In line ( ) choose the new value a for a h probabistically according to the distribution of noralized log-odds scores: R(a) sh l+1 a =1 R(a). Here, R(a) is the score obtained for M and B applied to sequence s h and otif location a. 11.1.5 Exaple We will now discuss an exaple taken fro Rouchka (1997) 1. The otif length is l = 6. Here we show the input sequences and the initial choice of locations (in red): 1 Rouchka, 1997, A Brief Overview of Gibbs Sapling

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 35 The nuber of A s in all of the sequences cobined is 327, the nuber of C s is 317, the nuber of G s is 272, and the nuber of T s is 304. To avoid zero entries in the otif atrix M, we will use 10% of each of the background counts as pseudocounts that are always added to the different counts obtained for different entries of M when calculating the frequencies. The pseudo counts to be added to values in M equal that is 32.7, 31.7, 27.2 and 30.4 for A, C, G and T, respectively. We will not look at the details of this, but this explains why the tables containing observed counts don t translate directly into the tables containing frequencies. In the first iteration of the ain loop we reove the first sequence s 1 =TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT. Here are M and B for the first sequence reoved: We then score all 36 = 41 6 + 1 possible words of length l in s 1 =TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT.

36 Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 The colun headed noralized A x contains the noralized log-odd scores for each choice of otif location in s 1. The algorith chooses one of the locations according to the given distribution of log-odds scores. We then repeat the ain loop and reove the second sequence etc. This is repeated until the score coverges or until a set axiu nuber of iterations has been perfored. Here is the final choice of otif:

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 37 11.2 The Projection Method We will now describe the projection ethod. Again, the coputational proble is to deterine a otif by analyzing a set of sequences that contain instances of the otif. Here is another way to foralize the otif-finding proble (Pevzner and Sze): Planted (l, d)-motif Proble: Suppose there is a fixed but unknown nucleotide sequence M (the otif) of length l. The proble is to deterine M, given sequences, each of length n and each containing a planted variant of M. More precisely, each such planted variant is a substring of length l that differs fro M in at ost d positions. In the Planted (l, d)-motif Proble assue the otif is ACAGGATCA. The following 4 sequences each contain a planted version of this otif: AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC ATGATAGCATCAACCTAACCCTAGATATGGGAT TTTTGGGATATATCGCCCCTACACAGGATCACT GGATATACAGGATCACGGTGGGAAAACCCTGAC Notice that soe variants fully agree on a subset of the positions of the full otif: ACAGGcTCc AtAGcATCA ACAGGATCA ACAGGATCA The key idea is to repeatedly choose k rando positions fro the l positions of the otif and then to hash words fro the sequences using only those positions, in effect projecting to those k positions: ACAGGATCA project AAGTC If we happen to choose a set of k positions that are well conserved by the otif then we will obtain a hash bucket that contains a higher-than-expected nuber of hits. In other words, to address the Planted (l, d)-motif Proble, the key idea is to choose k of l positions at rando, then to use the k selected positions of each l-er w in a hash function h(x). When a larger-than-expected nuber of l-ers hash to the sae bucket, then it is likely that the bucket is enriched for instances of the planted otif M: S 1 x x xo o 1 S 2 x xo ox 2 S 3 x o oxx 3 hash to sae bucket S 4 x 4 xo xo (Here, for each instance i of the planted otif M, x s ark the d = 3 substitutions and o s ark the k = 2 positions used in hashing.) Like any probabilistic algoriths, the projection algorith perfors a nuber of independent runs of a basic iteration. In each such trial, it chooses a rando projection h and hashes each l-er x in the input sequences to its bucket h(x). Any hash bucket with a high nuber of entries is explored as a source of the planted otif, in a refineent step, as described below.

38 Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 If k < l d, then there is a good chance that soe of the planted instances of M will be hashed to the sae bucket (the planted bucket), naely all planted instances for which the k hash positions and d substituted positions are disjoint. So, there is a good chance that the planted bucket will be enriched for the planted otif, and will contain ore entries than an average bucket. 11.2.1 Exaple Assue we are given the sequences s 1 s 2 s 3 1234567 cagtaat ggaactt aagcaca. Let M = aaa be the (unknown) (3, 1)-otif. Hashing using the first two positions of every word of length 3 we get the following hash table: h(x) pos. aa (1,5), (2,3), (3,1) ac (2,4), (3,5) ag (1,2), (3,2) at (1,6) ca (1,1), (3,4), (3,6) cc h(x) pos. cg ct (2,5) ga (2,2) gc (3,3) gg (2.1) h(x) pos. gt (1,3) ta (1,4) tc tg tt (2,6) The otif M is planted at positions (1, 5), (2, 3), (3, 1) and (3, 5) and in this exaple, three of the four instances hash to the planted bucket h(m) = aa. 11.2.2 Choosing the paraeters Obviously, the algorith does not know which bucket is the actual planted bucket. So it considers every bucket that contains at least s eleents, where s is a suitable threshold. The algorith has three ain paraeters: the projection size k, the bucket (inspection) threshold s, and and the nuber of independent runs t. In the following, we will discuss how to choose each of these paraeters. projection size: Ideally, the algorith should hash a significant nuber of instances of the otif into the planted bucket, while avoiding containation of the planted bucket by rando background l-ers. To iniize the containation of the planted bucket, we ust choose k large enough. What size ust we choose k so that the average bucket will contain less than 1 rando l-er? Since we are hashing (n l + 1) l-ers into 4 k buckets, if we choose k such that 4 k > (n l + 1), then the average bucket will contain less than one rando l-er. For exaple, in a Planted (15, 4)-Proble, with = 20 and n = 600, we ust choose k to satisfy: k < l d = 15 4 = 11 and k > log((n l + 1)) log(4) = log(20(600 15 + 1)) log(4) 6.76.

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 39 If the total nuber of sequences is very large, then it ay be that one cannot choose k to satisfy both k < l d and 4 k > (n l + 1). In this case, set k = l d 1, as large as possible. Bucket threshold: A bucket size of s = 3 or 4 is practical, as we should not expect too any instances to hash to the sae bucket in a reasonable nuber of runs. Nuber of independent runs: We want to choose t so that the probability is at least q = 0.95 that the planted bucket contains s or ore planted otif instances in at least one of the t runs. Let ˆp(l, d, k) be the probability that a given planted otif instance hashes to the planted bucket, that is: ) ˆp(l, d, k) = Then the probability that fewer than s planted instances hash to the planted bucket in a given trial is B,ˆp(l,d,k) (s). Here, B,p (s) is the probability that there are fewer than s successes in independent Bernoulli trials, each having probability p of success. If the algorith is run for t runs, the probability that s or ore planted instances hash to the planted bucket is: 1 ( B,ˆp(l,d,k) (s) ) t For this to be q, choose t = ( l d k ( l k). log(1 q) log(b,ˆp(l,d,k) (s)). (11.1) Using this criterion for t, the choices for k and s above require at ost thousands of runs, and usually any fewer, to produce a bucket containing sufficiently any instances of the planted otif. 11.2.3 Motif refineent The ain loop of the projection algorith finds a set of buckets of size s. In the subsequent refineent step each such bucket is explored in an attept to recover the planted otif. The idea is that, if the current bucket is the planted bucket, then we have already found k of the planted otif residues. These, together with the reaining l k residues, should provide a strong signal that akes it easy to obtain the otif in only a few iterations of refineent. We will process each bucket of size s to obtain a candidate otif. Each of these candidates will be refined and the best refineent will be returned as the final solution. Candidate otifs are refined either using the Gibbs-sapler or a related approach called the EM algorith. In essence, the projection ethod is used to copute a good starting point and thus greatly speed-up the convergence of a ethod such as the Gibbs-sapler algorith.