graph kernel approach to the identification and characterisation of structured noncoding RNs using multiple sequence alignment information Mariam lshaikh lbert Ludwigs niversity Freiburg, Department of omputer Science Feb 18th, 016 Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 1 / 1
Motivation Explosion in the discovery of nonidentified ncrns efficient automated approaches. Lack of automated classification tools done manually. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 / 1
pproach onservation important for unpaired region. ovariation important for base pairs. In this work onsider both conservation and covariation. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 3 / 1
Multiple lignment raph enerator (M) What is M M is a graph encoder tool. M can encode the evolutionary conservation of sequences and structures. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 4 / 1
Multiple lignment raph enerator (M) What is M M is a graph encoder tool. M can encode the evolutionary conservation of sequences and structures. Why M raph formalism flexible encoding. raphs powerful machine learning techniques (graph kernels). Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 4 / 1
Multiple lignment raph enerator (M) What is M M is a graph encoder tool. M can encode the evolutionary conservation of sequences and structures. Why M raph formalism flexible encoding. raphs powerful machine learning techniques (graph kernels). M aim Simulate experts on identifying interesting alignments for further investigation. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 4 / 1
Nerest Neighbourhood subgraph pairwise Distance kernal EDeN EDeN graph kernel tool. EDeN Extend the notion of kmears from string to graphs. It counts the fraction of identical pairs of neighborhood subgraphs. r=1, d=5 r=, d=5 r=3, d=5 Figure : Pairs of neighbourhood graphs for radius=1,,3 and distance=5 Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 5 / 1
Information in the input alignment files 1 The alignments are generated using Mfinder. Mfinder is an alignment tool, produces sequences that have consensus structure. Every alignment contains information about: Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 6 / 1
Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 7 / 1
Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Nucleotides conservation. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 8 / 1
Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Nucleotides conservation. 3 Strength of conservation. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 9 / 1
Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Nucleotides conservation. 3 Strength of conservation. 4 Entropy of the nucleotides. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 10 / 1
Information in the input alignment files The alignments are generated using Mfinder. Information contained in alignment files: 1 Secondary structure prediction. Nucleotides conservation. 3 Strength of conservation. 4 Entropy of the nucleotides. 5 ovariation of the secondary structure. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 11 / 1
raphs created by M M produces two different graph representations. 1 Node based graphs N. Summary based graphs S. Each representation can encode the information in: 1 One node:. List of nodes: L. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 1 / 1
Node based graphs 1 N encodes the information in one node. N L encodes the information in set of nodes forming a list. 1 1 Figure : N : onservation information encoded in single nodes. Figure : N L : onservation and covariation information in multiple nodes forming a list. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 13 / 1
Summary based graphs 1 S same as N but summary information about the structure is encoded. S L same as N L but more summary information about the structure is encoded. NS : NS : S :3 1 1 S :3 1.8 Figure : S : onservation Figure : S L : onservation and covariation This extra information can be the vg, Max, Min, of occurrence of a specific nucleotide or the conservation of the alignment information. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 14 / 1
Data description The motif sequences are from bacteria, archaea [Weinberg 010]. Z.Weinberg has manually annotated the alignments in functional and nonfunctional. They are binary classified. Data Num. files Num. classes vg seqs num. vg. seq. length Positive 308 classes 70 seqs 150 nucleotides Negative 160 10 classes 70 seqs 130 nucleotides Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 15 / 1
Evaluation The experiment data sets were balanced. Same number of files in pos and neg. Testing each pos class against the 10 neg classes. In total we have 0 experiments. The Receiver Operator haracteristic RO is the performance measurement. RO computes the true positive rate against the false positive rate. The final RO score is averaged over the different experiments. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 16 / 1
Results (RO) M S N S L N L N 1 = cons 0.86 N 1 = cons 0.85 N 1 = cov N = cons 0.69 N 1 = cons N = sscons 0.68 NS : NS :,,,,,, < >, < > S :3 1 1 S :3 1.8 < < > > < > < > Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 17 / 1
onclusion M can identify interesting ncrns up to RO 86%. Take home message The best graph representation is 1 Summary based S. Labelled with the conservation information. The tool can be used as: powerful prefiltering method for large amounts of alignments. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 18 / 1
urrent work Integrating M into ipython environment. Integrate automated alignment of input sequences into M. Encoding finer structural information as hairpins, bulges, and loops to improve the classification. Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 19 / 1
cknowkegment Prof.Dr. Rolf Backofen Dr. Fabrizio osta Dr.Zasha Weinberg Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 0 / 1
Questions Thank you for your attention Mariam lshaikh (S) M, niversity Freiburg Feb 18th, 016 1 / 1