Motif Finding Algorithms. Sudarsan Padhy IIIT Bhubaneswar

Motif Finding Algorithms Sudarsan Padhy IIIT Bhubaneswar

Outline Gene Regulation Regulatory Motifs The Motif Finding Problem Brute Force Motif Finding Consensus and Pattern Branching: Greedy Motif Search PMS: Exhaustive Motif Search Comparison of Motif Finding Algorithms Presentation of Our Work on Motif Finding

Central Dogma: DNA -> RNA -> Protein DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE

Splice Sites Almost all donor splice sites exhibit GU, Almost all acceptor splice site exhibit AG. Not all GUs and AGs are used as splice site, exploit that exons have higher GC content or that certain motifs are located nearby

Transcription Start Sites (TSS)

Combinatorial Gene Regulation A microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed How can one gene have such drastic effects?

Regulatory Proteins Gene X encodes regulatory protein, a.k.a. a transcription factor (TF) The 20 unexpressed genes rely on gene X s TF to induce transcription A single TF may regulate multiple genes

Regulatory Regions Every gene contains a regulatory region (RR) typically stretching 100-1000 bp upstream of the transcriptional start site Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor TFs influence gene expression by binding to a specific location in the respective gene s regulatory region - TFBS

Transcription Factor Binding Sites A TFBS can be located anywhere within the Regulatory Region. TFBS may vary slightly across different regulatory regions since non-essential bases could mutate

MOTIF A nucleotide / amino acid sequence pattern having biological significance Occurs repeatedly,either in same molecule or in many molecules. Found in upstream intergenic region Functionally important regions of genome( gene) can be recognized by searching such patterns Used for locating binding sites, regulatory signals, to control gene expression Identifying potential drug target sites

Motifs and Transcriptional Start Sites ATCCCG TTCCGG ATCCCG gene gene gene ATGCCG gene ATGCCC gene

Identifying Motifs Genes are turned on or off by regulatory proteins These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS) So finding the same motif in multiple genes regulatory regions suggests a regulatory relationship amongst those genes

Identifying Motifs: Complications We do not know the motif sequence We do not know where it is located relative to the genes start Motifs can differ slightly from one gene to the next How to discern it from random motifs?

The Motif Finding Problem Given a random sample of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca Ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the pattern that is implanted in each of the individual sequences, namely, the motif

The Motif Finding Problem (cont d) Additional information: The hidden sequence is of length 8 The pattern is not exactly the same in each array because random point mutations may occur in the sequences

The Motif Finding Problem (cont d) The patterns revealed with no mutations: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaact agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagt gc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatt tt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtata ca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacg tc acgtacgt Consensus String

The Motif Finding Problem (cont d) The patterns with 2 point mutations: cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaacc at agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagt gc aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatt tt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatata ca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacg Gc

Defining Motifs To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s 1,s 2,s 3,,s t )

Motifs: Profiles and Consensus a G g t a c T t t Alignment t t G C c A t a c g a c g t T A g a c g t C c A C c g t a c g 0 A 3 0 1 0 3 1 1 Profile 0 C 2 4 0 0 1 4 0 1 G 0 1 4 0 0 0 3 4 T 0 0 0 5 1 0 1 Consensus A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2,, s t ) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column

Consensus Think of consensus as an ancestor motif, from which mutated motifs emerged The distance between a real motif and the consensus sequence is generally less than that for two real motifs

Consensus (cont d)

Evaluating Motifs We have a guess about the consensus sequence, but how good is this consensus? Need to introduce a scoring function to compare different guesses and choose the best one.

Defining Some Terms t - number of sample DNA sequences n - length of each DNA sequence DNA - sample of DNA sequences (t x n array) l - length of the motif (l-mer) s i - starting position of an l-mer in sequence i s=(s 1, s 2, s t ) - array of motif s starting positions

Parameters l = 8 DNA cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttcca accat t=5 agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccaga agtgc aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaa atttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccat ataca n = 69 s s 1 = 26 s 2 = 21 s 3 = 3 s 4 = 56 s 5 = 60 ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgt acggc

Scoring Motifs l Given s = (s 1, s t ) and DNA: Score(s,DNA) = l max i= 1 k { A, T, C, G} count( k, i) a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 Consensus a c g t a c g t Score 3+4+4+5+3+4+3+4=30

The Motif Finding Problem If starting positions s=(s 1, s 2, s t ) are given, finding consensus is easy even with mutations in the sequences because we can simply construct the profile to find the motif (consensus) But the starting positions s are usually not given. How can we find the best profile matrix?

The Motif Finding Problem: Formulation Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score Input: A t x n matrix of DNA, and l, the length of the pattern to find Output: An array of t starting positions s = (s 1, s 2, s t ) maximizing Score(s,DNA)

Motif finding The motif finding problem manifests itself in two forms Known motif and unknown position: given a motif of length l with d mismatches, find its position in a genomic/protein sequence. Unknown motif and unknown position: given a set of sequences each of length m find a motif of length l with d mismatches. (l,d)-motif problem Introduced by Pavel Pevzner in 2001 The problem is to determine the motif M of length l, given n nucleotide sequences each of length m, and each containing a planted variant of M. More precisely, each such planted variant is a substring M that is with up to d point mutations. (d << l)

Expected number of motifs Assume that the sequences are i.i.d Probability that a given l-mer (l length motif) C occurs with up-to d mutations at a given position of a random sequence is Then the expected number of length l motifs that occur with up to d mutations at least once in each of the t random length n sequences is: Given t sequences of length n each, l the length of the motif, using a Brute force approach the complexity is l (n l + 1) t = O(l n t ). For t = 10, n = 1000, l = 10 { Fastest Computer-Jaguar-1759 tera Flops} 10 30 computations 10 18 /( 3600x24x365 x1759)>20 Million Years

The Motif Finding Problem: Brute Force Solution Compute the scores for each possible combination of starting positions s The best score will determine the best profile and the consensus pattern in DNA The goal is to maximize Score(s,DNA) by varying the starting positions s i, where: s i = [1,, n-l+1] i = [1,, t]

BruteForceMotifSearch 1. BruteForceMotifSearch(DNA, t, n, l) 2. bestscore 0 3. for each s=(s 1,s 2,..., s t ) from (1,1... 1) to (n-l+1,..., n-l+1) 4. if (Score(s,DNA) > bestscore) 5. bestscore score(s, DNA) 6. bestmotif (s 1,s 2,..., s t ) 7. return bestmotif

Running Time of BruteForceMotifSearch Varying (n - l + 1) positions in each of t sequences, we re looking at (n - l + 1) t sets of starting positions For each set of starting positions, the scoring function makes l operations, so complexity is l (n l + 1) t = O(l n t ) That means that for t = 8, n = 1000, l = 10 we must perform approximately 10 20 computations it will take billions years

CONSENSUS: Greedy Motif Search Find two closest l-mers in sequences 1 and 2 and forms 2 x l alignment matrix with Score(s,2,DNA) At each of the following t-2 iterations CONSENSUS finds a best l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA) under the assumption that the first (i-1) l-mers have been already chosen CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first

Some Motif Finding Programs CONSENSUS MULTIPROFILER Hertz, Stromo Keich, Pevzner (1989) (2002) MITRA GibbsDNA Eskin, Pevzner Lawrence et al (2002) (1993) Pattern Branching MEME Bailey, Elkan (1995) Price, Pevzner (2003) RandomProjections Buhler, Tompa (2002)

Planted Motif Challenge Input: n sequences of length m each. Output: Motif M, of length l Variants of interest have a hamming distance of d from M

How to proceed? Exhaustive search? Run time is high

Best Neighbor Branch from the seed strings Find best neighbor - highest score Don t consider branches where the upper bound is not as good as best score so far

Scoring PatternBranching use total distance score: For each sequence S i in the sample S = {S 1,..., S n }, let d(a, S i ) = min{d(a, P) P S i }. Then the total distance of A from the sample is d(a, S) = Si S d(a, S i ). For a pattern A, let D=Neighbor(A) be the set of patterns which differ from A in exactly 1 position. We define BestNeighbor(A) as the pattern B D=Neighbor(A) with lowest total distance d(b, S).

PatternBranching Algorithm

PatternBranching Performance PatternBranching is faster than other patternbased algorithms Motif Challenge Problem: sample of n = 20 sequences N = 600 nucleotides long implanted pattern of length l = 15 k = 4 mutations

PMS (Planted Motif Search) Generate all possible l-mers from out of the input sequence S i. Let C i be the collection of these l-mers. Example: AAGTCAGGAGT C i = 3-mers: AAG AGT GTC TCA CAG AGG GGA GAG AGT

All patterns at Hamming distance d = 1 AAG AGT GTC TCA CAG AGG GGA GAG AGT CAG CGT ATC ACA AAG CGG AGA AAG CGT GAG GGT CTC CCA GAG TGG CGA CAG GGT TAG TGT TTC GCA TAG GGG TGA TAG TGT ACG ACT GAC TAA CCG ACG GAA GCG ACT AGG ATT GCC TGA CGG ATG GCA GGG ATT ATG AAT GGC TTA CTG AAG GTA GTG AAT AAC AGA GTA TCC CAA AGA GGC GAA AGA AAA AGC GTG TCG CAC AGT GGG GAC AGC AAT AGG GTT TCT CAT AGC GGT GAT AGG

Sort the lists AAG AGT GTC TCA CAG AGG GGA GAG AGT AAA AAT ATC ACA AAG AAG AGA AAG AAT AAC ACT CTC CCA CAA ACG CGA CAG ACT AAT AGA GAC GCA CAC AGA GAA GAA AGA ACG AGC GCC TAA CAT AGC GCA GAC AGC AGG AGG GGC TCC CCG AGT GGC GAT AGG ATG ATT GTA TCG CGG ATG GGG GCG ATT CAG CGT GTG TCT CTG CGG GGT GGG CGT GAG GGT GTT TGA GAG GGG GTA GTG GGT TAG TGT TTC TTA TAG TGG TGA TAG TGT

Eliminate duplicates AAG AGT GTC TCA CAG AGG GGA GAG AGT AAA AAT ATC ACA AAG AAG AGA AAG AAT AAC ACT CTC CCA CAA ACG CGA CAG ACT AAT AGA GAC GCA CAC AGA GAA GAA AGA ACG AGC GCC TAA CAT AGC GCA GAC AGC AGG AGG GGC TCC CCG AGT GGC GAT AGG ATG ATT GTA TCG CGG ATG GGG GCG ATT CAG CGT GTG TCT CTG CGG GGT GGG CGT GAG GGT GTT TGA GAG GGG GTA GTG GGT TAG TGT TTC TTA TAG TGG TGA TAG TGT

Find motif common to all lists Follow this procedure for all sequences Find the motif common all L i (once duplicates have been eliminated) This is the planted motif

PMS Running Time It takes time to Generate variants Sort lists Find and eliminate duplicates Running time of this algorithm: w is the word length of the computer

REFERENCES 1. N.C.Jones,P.A.Pevzner-An Introduction to Bioinformatics Algorithms- MIT press-2004 2. P.A.Pevzner-Computational Molecular Biology-An Algoritmic Approach-MIT press- 2000 3. D.Gusfield- Algorithms on Strings, Trees and Sequences- Computer Science and Computational Biology-Cambrigde Univ. press-1997

Other Methods Existing approaches with limitations The motif finding problem has been studied extensively for many years in the literature using different approaches. Statistical [Workman and Stormo(2000); Thijs et al.(2001); Wang and Stormo(2003); Xing et al.(2004); Regnier and Denise(2004); Sinha et al. [2004]; Siddharthan et al. (2005); Shida (2006 ); Leung and Chin (2006); Carmack et al.(2007)] The limitations of existing statistical approach is that it ignores evolutionary relationship which can give some information about where a motif is most likely to be found between very similar genomes.

Other Methods(continued) Heuristic approach [Liu et al.(2002); Buhler and Tompa(2002); Blanchette and Tompa (2002); Liu et al.(2004); Hu et al.(2006); Mendes et al.(2006); Wei and Jensen (2006); Peng et al. (2006)] has polynomial complexity but does not guarantee optimality, where as Combinatorial (exhaustive) approach guarantees optimality at the cost of combinatorial complexity. Exhaustive [Marsan and Sagot (2000); Pevzner and Sze (2000); Sinha and Tompa(2000); Apostolico et al.(2000); Eskin and Pevzner (2001); Liu et al.(2002 ); Sinha (2003); Wang et al.(2005); Hon and Jain (2006); ] Approach guarantees optimality at the cost of combinatorial complexity. Despite many studies this problem is far from being satisfactorily solved.

OUR METHODS (A. mohapatra,p.m.misra,s.padhy) Motif finding using Information-Entropy with Kullback-Leibler- Divergence International Journal of Computer Science and Network Security, Vol. 9 No. 1 pp. 147-154, 2009(ISSN: 1738-7906) Motif Search in DNA Sequences Using Generalized Suffix Tree IEEE Computer Society,( ISBN: 0-7695-3068-0, DOI: 10.1109/ICIT.2007.18) Motif Prediction using Weighted Degree String Kernels with Shift and Mismatch ACM (ICAC3-09), Pages 56-61, ACM ( ISBN: 978-1-60558-351-8 DOI: http://doi.acm.org/10.1145/1523103.1523116). Prediction of Motifs with Unknown Length and Allowable Mismatches using Kernel Based Approach

Motif finding using Information-Entropy with Kullback-Leibler-Divergence International Journal of Computer Science and Network Security, Vol. 9 No. 1 pp. 147-154, 2009(ISSN: 1738-7906)

Related Motif finding Approaches Information entropy, first introduced by Claude E.Shannon in his paper A Mathematical theory of Communications in 1948 describes information content of a signal or event. Models based on Positional Weight Matrix [Bucher,1990] [CONSENSUS, Gibbs sampling, EM etc.] PWM can be created by counting the number of times a base is observed at each position Limitations Existing algorithms based on PWM consider the most popular element in each column of the alignment matrix, and completely ignore the rest which are comparatively weak (probability of occurrence is less). Models based on Information Entropy [Schneider, 1990][LOGOS,2004] From information entropy point of view all sequences (strong or weak) can contain different amount of information about the biological signal in various forms and thus can be a potential candidate for replacing the PWM model

Contd. Let Σ be an alphabet of size N, for DNA Σ = {A, C, G, T}. A symbol is received from an alphabet of size N, Letters are sent with equal probability Thus the information entropy contained in a symbol from an alphabet of size N is log 2 N, under uniform probability distribution. Close parallels between the mathematical expressions for the thermodynamic entropy S, and the Shannon information theoretic entropy H. The measure of information (Shannon entropy) is mathematically, similar to that of disorder (Boltzmann entropy) log H( p = I = 2 Nn 1,..., pn) n N = pi log2 pi i = 1

Motifs: Profiles and Consensus Alignment a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G A 3 0 1 0 3 1 1 0 Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 Consensus A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2,, s t ) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column

PWM TTGTGGC TTGTGAT GCCTGAC TTTTGAT ATTTATT TTGTGAT AAGTGTC ATTTGCA ACGTGAT ATGTGAG TTGTGAT GTGTGAA SEQUENCES CTGTGAG ATGCAAA TTGTGAG CTGTAAC CTGTGAC ATGAGAC PWM GTGTTAA CTGTGAA TTGTGAG ATTTGAA TTGTGAC

Models based on Information Entropy Given a set of genomic sequences, it can be used to rank each member sequence according to its information content. This helps in evaluating each input sequence and takes care of weak signals which were completely ignored by earlier PWM model. Limitations The model based on information entropy measures the information quantity. It says nothing about the quality of information. Given two substrings in a sequence with equal quantity of information, only may contain biologically relevant information (e.g. motif).

Proposed Methodology : Shannon Entropy along with Kullback-Leibler Divergence Motivations Biological observation The biological motif, more stable structure and is expected to have lower entropy Physical observation Chaotic systems have a higher entropy value and the stable systems have lower entropy value. Motifs are short conserved regions in DNA, RNA, or Protein sequences and are expected to have lower entropies.

Proposed Methodology : Shannon Entropy Along With Kullback-Leibler Divergence From Kullback Leibler divergence In probability theory the Kullback Leibler divergence is a non-commutative measure of the difference between two probability distributions. It is non-negative and is 0 if both distributions are equivalent (p=q). The smaller the divergence, the more similar is the distribution of the two variables. This motivates that the measure based on KL divergence can be used to model the important aspects of interesting signals or significant patterns (e.g. motifs) in DNA sequences

Algorithm Step1. Given a set of sequences and length of an unknown signal, divide each sequence to get substrings of length equal to the given length of the signal. Step2. Compute the information entropy of all substrings obtained in Step 1. Step3. Choose those substrings that fall within a selected range of lower information entropies computed in step2. Step4. Find a consensus sequence of the set of substrings obtained in step3. Step5. Calculate the Kullback Leibler divergence for each substrings of step3 with respect to the consensus string obtained in step4. Step6. Keep those substrings that fall within a selected range of lower Kullback-Leibler divergence computed in step5. Step7. The consensus of the set of strings described in Step6 is the required signal.

Implementation We tested our method on various synthetic datasets. Randomly implanted motif As the results on the synthetic data set are promising, we applied our method to real data sets: transgenic data regions of E.Coli ECRDB70 database. Implemented in MatLab 7.1, Windows environment

Performance Evaluation The performance of our model on the real data set was computed using the following standard parameters Sensitivity: It is the percent of true motifs correctly predicted as true motifs. It measures how good our model in hitting the real motifs is. Specificity: It is the percent of non-motifs correctly predicted as non-motifs. It measures how good our model in hitting the non motifs. Accuracy: It is percentage of correctly predicted motifs. It measures how much accurate is the model in correctly predicting the motifs.

Result

Result Information Entropy Synthetic Data with randomly implanted motif Motif - Predicted - TAGCTTCATCGTTGA AAAAAAGAAAAAGAA

Result Kullback Leibler Divergence Synthetic Data with randomly implanted motif Motif AGAGAGAGAGAGAGA Predicted - AGAGAGAGAGAGAGA

Result Kullback Leibler Divergence Synthetic Data with randomly implanted motif Motif - TAGCTTCATCGTTGA Predicted - AATTTTCTTCATTTA

Result Kullback Leibler Divergence Synthetic Data with randomly implanted motif Motif - TAGCTTCATCGTTGA Predicted - TAGTTTCATCATTGA

Result Kullback Leibler Divergence E.Coli ECRDB Database Motif - AAATTTCGA Predicted - AACTCTCGA

Result The KL-Divergence model generated the consensus with an improved accuracy of 74% as compared to 57% in the model based only on Information entropy. The sensitivity and specificity of the KL- Divergence model were 75% and 72% respectively as compared to 33% and 59% in the model based on Information entropy.

Conclusion The PWM based model ignores the weak signals since the model completely discards the bases with lower probability of occurrence. The model based on information entropy overcomes the deficiencies associated with the PWM model but fails to distinguish between signal and noise when the information contents are same. The proposed model based on information entropy along with Kullback-Leibler divergence successfully addresses the problem of distinguishing between the relevant and irrelevant information.

conclusion We observe from the above test runs of the proposed algorithm that besides predicting the motifs which are correctly predicted by other methods it also predicts more effectively the motifs which are otherwise not correctly predicted by other models.

Motif Search in DNA Sequences Using Generalized Suffix Tree IEEE Computer Society,( ISBN: 0-7695-3068-0, DOI: 10.1109/ ICIT. 2007.18)

Generalized Suffix Tree Generalized Suffix tree (GST) When a set of strings are required to be represented by suffix tree, a generalized suffix tree (GST) is used. A GST for strings S1, S2, S3..SN can be built in O( S1 + S2 + S3 + SN ) time using Ukkonen algorithm.

Generalized suffix tree (Example) Let s 1 =abab and s 2 =aab here is a generalized suffix tree for s 1 and s 2 { } $ # b$ b# ab$ ab# bab$ aab# abab$ a b $ 1 b $ 3 # a 2 b a b # 1 a b$ $ 2 # $ 4 # 5 3 4

Related Work Leaf count information has been proven to be useful in reducing the search space for a given set of sequences. When the nodes are ranked on the basis of leaf count in a generalized suffix tree (GST) and if there is a partial match in the label of the first edge, the chances of full match in the subtree are highest. If there is no partial match, the biggest subtree from the search space can be eliminated from further processing. Limitations This may not always be true in biological applications. The occurrences of the motif in the given genomic sequences may not be exact. Hence it is important to find them even in the presence of small errors.

Contd Appropriate branching decision Reduce the search space. The branching decision is based on very limited information about the likelihood that a particular branch will lead to a solution. Limitation This decision is usually taken at a very early stage of the tree, yet it can have dominating influence on the overall efficiency of the search process. It may so happen that even the reasonable choices made at an early stage can lead to extremely poor results. Dynamic branching has been used as another strategy for further reduction of search space. It is capable of identifying un-influential branches on the basis of more reliable information available from later stages. Limitation It does not allow mismatches.

Additional Parameters Edge label length The number of characters on the edge from the root to a node, is an important parameter in measuring the biological information it carries. Frequency of visits of an edge More visits of an edge in the GST points to a more probable conserved region and signify higher probability of occurrences of that pattern in the motif. Partial match with interspersed mismatch During the process of exact matching if there is a partial match in an edge then the chances of a full match in the later stage is maximized.

Proposed Scoring Scheme Let x be the number of leaves in the subtree rooted at v. Edge length of node v is the number of characters from the root to v. Let t be edge length of node v, y be the frequency of visit of that edge. Let z be the number of matches of the edge label with the given motif with match weight w. Then the total score of node v = x + y*t + z*w.

Algorithm 1. Read all the input DNA sequences 2. Construct a generalized suffix tree (GST) for the input sequences. While constructing the GST, at each internal node at tree depth i, store the edge label length, number of leaves and frequency of visit of an edge. 3. For each edge label at tree depth i compute the partial match score comparing with the given motif. 4. Compute the scoring function for each of the nodes at tree depth i. Rank the nodes in nonincreasing order of the scoring function. If total score of two sub trees are equal we assign higher rank to the one that has more leaf count.

Algorithm. 5. Search the sub tree under that node that has got highest score Update l as l - number of characters in the path traversed so far. Update d as d - the number of mismatches in the path traversed so far. If along an edge number of mismatches >d then visit the next ranked node. 6. Otherwise 7. Repeat step 3 and 4 till a motif is reported. (l=0)

Result

Conclusion We have discussed the motif finding problem as a type of tree based search problem. Presented an algorithm that preprocess the generalized suffix tree using parameters like leaf counts, edge label length with frequency of its visit and partial match with allowable interspersed mismatches. A composite scoring scheme is used to rank the nodes and is used to discard the unproductive search space by selecting the highest ranking search space.

conclusion Allows mismatches Reduces search space Explained with synthetic and implanted data

Motif Prediction using Weighted Degree String Kernels with Shift and Mismatch. ACM (ICAC3-09), Pages 56-61, ACM ( ISBN: 978-1-60558-351-8 DOI: http://doi.acm.org/10.1145/1523103.1523116)

Problem: Find a given motif of length l with up to m number of mismatches in a given set of DNA sequences. We use a class of string kernel, called weighted degree kernel with shift and extend it to allow mismatches. we use the suffix tree based mismatch tree data structure to train the SVM using a scoring scheme as a speedup measure during implementation

Support Vector Machines (SVM) Support Vector Machines (SVMs) are supervised learning algorithms first introduced by Vapnik in 1995. SVM use Kernel functions as the core for classification and regression analysis. The kernel function corresponds to an inner product in some expanded feature space. The computation in high dimensional feature space is avoided through the use of kernels. Higher dimensional feature space allows obtaining a linear separation for data which are otherwise not linearly separable in lower dimension.

Support Vector Machines (SVM)

Support Vector Machines (SVM) SVMs and related kernel methods are extremely good at solving prediction problems in computational biology. Can easily handle non-vector inputs such as variable length sequences or graphs. Ability to deal with high-dimensional large data sets. Flexibility in modeling even with small training samples. Due to Mercer s conditions on the kernels the corresponding optimization problems are convex and hence have no local minima. These facts mark a clear advantage of SVM over other pattern recognition algorithms, such as neural networks.

Related Work Weighted Degree Kernel [Ratsch, G., & Sonnenburg, S. (2004)] The WD kernel of order K compares two sequences x and x of equal length L by summing all contributions of k- mer matches of length k ε {1, 2,.K )weighted by coefficient β k ' K L k + 1 ' K ( x, x ) = βk I ( uk i ( x) = u k = 1 i= 1, k, i ' ( x )) where u k,i (x) = string of length k starting at position i of the sequence x. I(.) evaluates to 1 when the argument otherwise. ' 2( K k + 1) β k = ' ' K ( K + 1) is true, zero Limitations Recognition of matching blocks using WD kernel strongly depends on the position of the subsequence and does not tolerate any positional variation, e.g. if there is a single shift in one of the sequence by just only one position, the WD kernel fails to discover similar blocks and returns a lower similarity score

Related Work Weighted Degree Kernel with shift (WDS) [Ratsch, G., Sonnenburg, S., & Scholkopf, B. (2005)] WD kernel with shift tolerates small positional variations in the sequences to be compared. Where δ s =1 / 2(s+ 1) K(x,x ' ) = K ' k = 1 β k L k + 1 i= 1 s s= 0 s+ i L 2(K' k + 1) β k = K '(k+ 1) δ s μ k,i,s,x,x ' μ [ ' ' I(u (x) = u (x )) + I(u (x) u (x ))] = k,i,s,x,x ' k,i+ s k,i k,i = k,i+ s Limitations Do not allow mutations (mismatches)

Proposed Kernel To allow mutations in the candidate motifs, weighted degree kernel with shift and mismatches is proposed. Where ' K M k= 1m= 0 L k+ 1 i= 1 k( x, x ) = β δ μ μ ' k,i,s,x,x = I(u k,i + k, m (x) s m s s= 0 s+ i L ' u (x k,i k, i, s, x, x u u evaluated to true if and only if there are exactly m mismatches between u and u. Choose β k,m = total weight / no. of possible m-mismatching k-mer. s ) + I(u (x) k,i ' m u k,i + ' (x s )

Proposed Kernel To allow mutations in the candidate motifs, weighted degree kernel with shift and mismatches is proposed. ' K( x, x ) = β δ μ Where μ ' k,i,s,x,x = K ' M k= 1m= 0 I(u k,i + k, m (x) s L k+ 1 i= 1 m s s= 0 s+ i L ' u (x k,i k, i, s, x, x u u evaluated to true if and only if there are exactly m mismatches between u and u. Choose β k,m = total weight / no. of possible m-mismatching k-mer. ) s + I(u (x) k,i ' m u k,i + ' (x s )

Algorithm Step1: Step 2: Input the m training sequences and n test sequences from files in FASTA format Create two arrays of strings, one for the training DNA sequences and other for the test and validation DNA sequences. The sizes of these arrays are m x m and n x n respectively.

Step 3: Step 4: Initialize the parameters such as the length of the motif, allowable mismatches, maximum shift, and length of sequences to a desired value. Also initialize the constant C>0,that sets the relative importance of maximizing the margin and minimizing the amount of slack. Compute the kernel function for each pair of training sequences using the following steps: Step (4a): Compare all pairs of sequences by summing all contribution of the k-mer matches of length 1 to k.

Step (4b): Compute the indicator function which evaluates to 1 when the contributions are same, zero otherwise. Step (4c): Reward each matching substring by a weighting coefficient : β k β k, m β k,m = ( k m )( Σ 1 ) m Step 5: Step 6: Set the number of iterations, initial margin of separation and precision for the SVM computation, and initial values of the Lagrange multipliers α i Compute the optimum values for all the training sequences by solving the quadratic optimization problem using step 1 of the algorithm for general SVM.

Step 7: Step 8: Step 9: Compute the bias b using step 2 of the algorithm for general SVM. For each of the test and validation sequences compute the SVM decision function based on the computed parameters using step3 of the algorithm for general SVM. The value indicates whether the sequence contains the motif or not. Compute the performance measures.

Performance Evaluation Receiver operating Characteristic Curve (ROC Curve) The ROC is represented by plotting the true positive rate (TPR) vs. false positive rate (FPR). Area under the ROC curves (auroc): The area under Receiver Operating Curve is calculated by adding trapezoids. auroc quantifies the quality of the classifier and value >0.5 indicates better performance. Area under the PRC (auprc): PRC point = true positives / all true labels, true positives / (true positives + false positives) Cross Correlation coefficient (CC) CC = (true positives * true negatives - false positives * false negatives) / sqrt ( (true positives + false positives) * (true positives + false negatives) * (true negatives + false positives) *(true negatives + false negatives) ) F-Measure and Accuracy F-measure = 2 / (1 / precision + 1 / recall) Accuracy = (true positives + true negatives) / all labels

Implementation The proposed kernel was first tested on various synthetic data sets. As the results of these tests are promising, the kernel was applied to real data sets comprising of the C. Elegans and Drosophila Melanogaster acceptor sites. The SHOGUN tool box was used for implementation of the proposed kernel. The kernel was developed using c++. A suffix tree based data structure with a scoring scheme is used to speedup the SVM training and classification. Synthetic Data. randomly generated strings with signals of different length implanted at random positions Real Dataset. C.elegans acceptor sites, 262421 sequences containing 15507 true splice sites. (5.9%) Drosophila acceptor sites, 98367 sequences containing 1583 true splice sites. (1.6%)

Result Synthetic Data Synthetic Data (5 Mut)

Synthetic Data (5Mut)

Result Drosophila C.Elegans

Result

conclusion The WDSM kernel allows mismatches to be used along with shift while computing the kernel from the given DNA sequences. The mismatch and shift values together resulted in better representation of the motifs as well as the positions within the sequence. This has resulted in higher accuracy in classifying the DNA sequences for predicting the motifs. The method has ability to detect motifs even with a small size of training samples. A suffix tree based data structure with a scoring scheme is used to speedup the SVM training and classification.

Prediction of Motifs with Unknown Length and Allowable Mismatches Using Kernel Based Approach

Related Works Spectrum Kernel [Leslie, C., Kuang, R., & Eskin, E. (2003)] Especially tuned to small alphabets (like the 2-bit DNA alphabet). Defined as the inner product K( x, x') = Φ k ( x). Φ ( x') Where, Ø k maps a sequence x that consists of characters in to a feature space of size k Let #u(x) is the number of occurrence of u in x, k be the length of pattern u, be an alphabet Then Ø k (x) can be defined as φk( x) = (# u( x)) u ε k k

Related Works Weighted Degree kernel without shifts (WDKWS) introduced by Sonnenburg et al., 2006, uses a weighted sum of spectrum kernels for all degrees at every position of the sequence. The difference between the WDKWS and the spectrum kernel are the position dependence and the consideration of K-mers vs. 1 K-mers, i.e. i.e. also considering subsequences of the K-mers during computation.

Related Works Weighted Degree kernel with shift and mismatch (WDKSM) As already presented, aims at improving the representations of the weighted degree kernel which allows mismatches. Has resulted in higher accuracy in classifying the DNA sequences for predicting the motifs. Ability to detect motifs even with small sized training samples.

Limitations Spectrum Kernels Cannot extract any positional information from the sequence which goes beyond the k-mer length. Well suited for describing the content but not for analyzing signals where motifs appear in a certain order. Weighted Degree kernel without shifts When using long K-mers, the memory demand becomes intractable and the sparse weight vectors need to be stored and operated with more efficiently. Hence it is difficult to implement this kernel for practical purpose.

Limitations Weighted Degree kernel with shift and mismatch (WDKSM) This works fine for classification and finding motifs when the length of the motif and the number of allowable mismatches are known. Most motif discovery algorithms for DNA sequences require the motif's length as input. Styczynski et al. introduced the Extended (l,d)-motif Problem (EMP) where the motif's length is not an input parameter. Their algorithm takes an unacceptably long time to run.

Motivation for new Kernel Feature Space The size of the feature map Ø for string kernels for any l length pattern over alphabet A is of dimension l A. Feature space size increases exponentially with the increase in the pattern length. So difficult for higher length patterns. Most of the dimensions of the feature space are sparse. The feature space can be reduced to the total number of sub sequences occurring in the training set. Behavior of Spectrum Kernels the output no longer depends on the characteristics of the input such as from which organism the sequence has been obtained, the length of the sequence, the distribution of nucleotides etc. e.g. the kernel computed from AAAA, CCCC or AAAA, TTTT or GGGG, CCCC are same.

Motivation for new Kernel Local and global requirements of motif prediction The string kernels rely on the indicator functions and pair wise string comparisons for computation of the kernel value. Values should be more significant for related or similar sequences as compared to unrelated or dissimilar ones. Not observed in the existing string kernels as the sequence comparisons are made starting from single nucleotides to a certain length, allowing shifts and mismatches We have proposed a kernel which provides meaningful scores depending only on the motifs it contain The kernel abstraction can be utilized for designing a generalized kernel for detection of unknown motifs without depending on the training information of the organisms or other domain specific information.

Motivation for new Kernel Design a new spectrum kernel to possibly cater to the above observations by reducing search space, generating meaningful values which can be generalized.

Proposed Weighted Spectrum Kernel Let S be a set of n sequences each of length l from alphabet for which we intend to determine the motifs with unknown length l and known mutations d. Let x and x be pair of sequences from S, u i be any k- mer of length k occurring in x at position i and u i (x,x ) be the number of occurrences of u i in x and x. Let u i (S-{x,x }) be the number of occurrences of u i in the sequences of the training set S-{x,x }, B i be the weight attached to each position i. The weight coefficient B i = 2(T- u i +1)/T(T+1), where T is the total number of all possible length k k-mers occurring in the training set S-{x,x }.

Proposed Weighted Spectrum Kernel Then the proposed weighted degree spectrum kernel is defined as: K( x, x') = l k+ 1 B i u ( x, x i= 1 { x, x'}) This kernel during computation of the spectrum values besides considering the number of times the pattern occurring in the pair of sequences being compared, also considers the number of times the k- mer occurs in the other sequences of the training set. i ' ) u i ( S

Implementation The proposed kernel was first tested on various synthetic data sets. As the results of these tests are promising, the kernel was applied to real data sets comprising of the C. Elegans and Drosophila Melanogaster acceptor sites and also the transcription start sites (TSS) of human genome hg16. The kernel was developed using c++. The SHOGUN tool box was used for implementation of the proposed kernel for the quadratic optimization. Synthetic Data. randomly generated strings with signals of different length implanted at random positions. Real Dataset. C.elegans acceptor sites, 262421 sequences containing 15507 true splice sites. (5.9%) Drosophila acceptor sites, 98367 sequences containing 1583 true splice sites. (1.6%) Hg16 TSS of human genome with 14,18,392 sequences containing 14897 true sites (1.05%)

Result

Result Synthetic data set 1 Kernel Accuracy AuROC Fixed Degree 65.00% 0.833 Spectrum Kernel 38.00% 0.204 Weighted String Kernel 53.00% 0.688 Weighted Spectrum Kernel Synthetic data set 2 70.00% 0.913 Kernel Accuracy AuROC Fixed Degree 90.15% 0.604 Spectrum Kernel 89.23% 0.511 Weighted String Kernel 90.90% 0.449 Weighted Spectrum Kernel 90.95% 0.803

Result C.Elegans Acceptor data Kernel Accuracy AuROC Fixed Degree 93.10% 0.85 Spectrum Kernel 94.05% 0.57 Weighted String Kernel 92.10% 0.63 Weighted Spectrum Kernel 94.10% 0.91

Result Drosophila Acceptor data Kernel Accuracy AuROC Fixed Degree 98.39% 0.797 Spectrum Kernel 97.55% 0.529 Weighted String Kernel 94.79% 0.570 Weighted Spectrum Kernel 98.39% 0.881

Result Hg16 TSS data Kernel Accuracy AuROC Fixed Degree 96.94% 0.66 Spectrum Kernel 97.33% 0.59 Weighted String Kernel 97.10% 0.71 Weighted Spectrum Kernel 98.94% 0.75

conclusions The performance of the new kernel was found to be better both in terms of the classification accuracy and auroc. The kernel also outperformed the existing string kernels by allowing for classification with motif length of more than 8. The new kernel yielded better performance from the other kernels with a considerable improvement of the auroc from 0.449 to 0.81. improved performance of the SVM classifier using the new kernel was also observed when the data sets containing acceptor sites of Drosophila, C.Elegans were used. the ability of the SVM to classify sequences with mutated motifs using the same training data. The accuracy of classification did not depend on the training motif, the mutations and to some extent on the length.

conclusion Cross classification of sequences for prediction of unknown motif. The SVM acceptor sites of drosophila was used to predict the C.Elegans acceptor sites and showed an accuracy of 94% similar to the result obtained when trained with its own samples.

References Leslie, C., Kuang, R.,and Eskin, E., Inexact matching string kernels for protein classification, Kernel Methods in Computational Biology,pp 95-112, MIT Press, 2003. Ratsch, G., Sonnenburg, S., & Scholkopf, B. (2005). Rase: Recognition of alternatively spliced exons in c. elegans. ISMB 2005. Ratsch, G., & Sonnenburg, S. (2004). Accurate splice site prediction for caenorhabditis elegans, 277-298. MIT Press series on Computational Molecular Biology. MIT Press. Sonnenburg, S., Raetsch, G., Schaefer, C.,and Schoelkopf, B., Large Scale Multiple Kernel Learning, Journal of Machine Learning Research,7, pp1531-1565, 2006. Styczynski, M.P., Jensen, K.L., Rigoutsos, I.,and Stephanopoulos, G.N., An extension and novel solution to the (l,d)-motif challenge problem, Genome Informatics, Vol. 15, pp.63 71, 2004.

THANK YOU