SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

Similar documents
RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev

EECS730: Introduction to Bioinformatics

REDUCING THE WORST CASE RUNNING TIMES OF A FAMILY OF RNA AND CFG PROBLEMS, USING VALIANT S APPROACH

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

Reducing the worst case running times of a family of RNA and CFG problems, using Valiant s approach

Lecture 2: Pairwise Alignment. CG Ron Shamir

Dynamic Programming: Edit Distance

An Introduction to Bioinformatics Algorithms Hidden Markov Models

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

UNIT 5. Protein Synthesis 11/22/16

11.3 Decoding Algorithm

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

Tandem Mass Spectrometry: Generating function, alignment and assembly

Hidden Markov Models

In Genomes, Two Types of Genes

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Algorithms in Bioinformatics

Lesson Overview. Ribosomes and Protein Synthesis 13.2

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Regular expression constrained sequence alignment revisited

Properties of Context-Free Languages

Structure-Based Comparison of Biomolecules

Chapter 6.2. p

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Closure under the Regular Operations

Translation Part 2 of Protein Synthesis

Two Algorithms for LCS Consecutive Suffix Alignment

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Finite Automata. Seungjin Choi

13 Comparative RNA analysis

Sparse RNA Folding: Time and Space Efficient Algorithms

Sequence analysis and Genomics

RNA secondary structure prediction. Farhat Habib

Simulation of Gene Regulatory Networks

Efficient Algorithms forregular Expression Constrained Sequence Alignment p. 1/35

GCD3033:Cell Biology. Transcription

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Lecture 7: Simple genetic circuits I

Pattern Matching (Exact Matching) Overview

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Pairwise sequence alignment

Non-context-Free Languages. CS215, Lecture 5 c

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Lecture 4 Nondeterministic Finite Accepters

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

Java II Finite Automata I

Pair Hidden Markov Models

Analysis and Design of Algorithms Dynamic Programming

Predicting RNA Secondary Structure

Multiple Choice Review- Eukaryotic Gene Expression

Before we show how languages can be proven not regular, first, how would we show a language is regular?

Hidden Markov Models 1

Finite Automata. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Chapter 2: Finite Automata

Tobias Markus. January 21, 2015

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

More Dynamic Programming

Network motifs in the transcriptional regulation network (of Escherichia coli):

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

More Dynamic Programming

From Gene to Protein

CMPSCI 250: Introduction to Computation. Lecture #22: From λ-nfa s to NFA s to DFA s David Mix Barrington 22 April 2013

Implementing Approximate Regularities

CS5371 Theory of Computation. Lecture 7: Automata Theory V (CFG, CFL, CNF)

A GENETIC ALGORITHM FOR FINITE STATE AUTOMATA

Bioinformatics Chapter 1. Introduction

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

A Method for Aligning RNA Secondary Structures

Today s Lecture: HMMs

Bio nformatics. Lecture 3. Saad Mneimneh

Lesson Overview. Gene Regulation and Expression. Lesson Overview Gene Regulation and Expression

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

September 7, Formal Definition of a Nondeterministic Finite Automaton

Hidden Markov Models

A faster algorithm for RNA co-folding

Introduction to Sequence Alignment. Manpreet S. Katari

Lecture 5,6 Local sequence alignment

CS 154 Formal Languages and Computability Assignment #2 Solutions

A Structure-Based Flexible Search Method for Motifs in RNA

arxiv: v1 [cs.ds] 9 Apr 2018

Theory Bridge Exam Example Questions

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

BME 5742 Biosystems Modeling and Control

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Sequence analysis and comparison

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Comment: The induction is always on some parameter, and the basis case is always an integer or set of integers.

Mathematics for linguists

Midterm 2 for CS 170

Computational Biology: Basics & Interesting Problems

Combinatorial approaches to RNA folding Part I: Basics

Transcription:

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint Nimrod Milo Tamar Pinhas Michal Ziv-Ukelson Ben-Gurion University of the Negev, Be er Sheva, Israel Graduate Seminar, BGU 2010 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 1 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint 3 Applying SA-REPC to microrna target prediction 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 2 / 54

Michal s group Michal Ziv-Ukelson Tamar Pinhas Isana Vaksler Noa Mussa Sivan Yogev Shay Zakov Erez Katzenelson Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 3 / 54

Topics of interest in our group Sequence and tree alignments and similarity Indexing, searching and compression Secondary structure prediction of RNA: folding and co- folding. microrna-mrna target prediction Sequence/structure motifs involved in localization and post-transcriptional regulation Post-transcriptional regulation: virus-host micro RNA- mrna behavior Protein motif discovery (common signals within family) Algorithms on Strings and Trees More... Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 4 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 5 / 54

MTP: Manhattan Tourist Problem s a a a a a a a a a a a a a a a a a a a a a a a a Imagine seeking a path (from source to sink) to travel on (going only eastward and southward) with the highest number of attractions on it, marked by weights on the streets (edges) in a Manhattan grid. a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 6 / 54

MTP: Manhattan Tourist Problem 1 10 s a a a a 2 3 a a a a a a a a a a 1 4 5 a a a a a 3 a a a a a Imagine seeking a path (from source to sink) to travel on (going only eastward and southward) with the highest number of attractions on it, marked by weights on the streets (edges) in a Manhattan grid. 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 6 / 54

MTP: Manhattan Tourist Problem 1 10 s a a a a 2 3 a a a a a a a a a a 1 4 5 a a a a a 3 a a a a a Imagine seeking a path (from source to sink) to travel on (going only eastward and southward) with the highest number of attractions on it, marked by weights on the streets (edges) in a Manhattan grid. 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 6 / 54

Manhattan Tourist Problem: Formulation Goal Find the highest scoring path in a weighted grid. Input A weighted grid G with two distinct vertices, one labeled source and the other labeled sink. Ouput Output: A longest path in G from source to sink Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 7 / 54

MTP solution using Dynamic programming Each vertex s score is the maximum of the prior vertices score plus the weight of the respective edge in between Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 8 / 54

MTP solution using Dynamic programming Each vertex s score is the maximum of the prior vertices score plus the weight of the respective edge in between Computing the score for a point (i,j) by the recurrence relation: S 0,0 = 0 { } Si 1,j + score of the edge between(i 1, j)and(i, j) S i,j = max S i,j 1 + score of the edge between(i, j 1)and(i, j) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 8 / 54

MTP solution using Dynamic programming Each vertex s score is the maximum of the prior vertices score plus the weight of the respective edge in between Computing the score for a point (i,j) by the recurrence relation: S 0,0 = 0 { } Si 1,j + score of the edge between(i 1, j)and(i, j) S i,j = max S i,j 1 + score of the edge between(i, j 1)and(i, j) Running time The running time of the above formula for a grid of size n m is: O(n m) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 8 / 54

Example 1 10 s a a a a 2 3 * a a a a S 1,0 = S 0,0 + 2 = 0 + 2 a a a a a 1 4 5 a a a a a 3 a a a a a 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 9 / 54

Example 1 10 s a a a a 2 3 a * a a a a a a a a 1 4 5 a a a a a S 1,0 = S 0,0 + 2 = 0 + 2 S 1,1 = max(s 0,1 + 0, S 1,0 + 3) = max(1 + 0, 2 + 3) 3 a a a a a 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 9 / 54

Extending the MTP problem 1 10 s a a a a 2 3 a a a a a a a a a a 1 4 5 a a a a a 3 a a a a a 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 10 / 54

Extending the MTP problem 1 10 s a a a a 2 3.5 a a a a a Changing the scores to real numbers. a a a a a 1 4.6 5 a a a a a 3 a a a a a 3.12 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 10 / 54

Extending the MTP problem 1 10 s a a a a 2 3.5 a a a a a Changing the scores to real numbers. Adding diagonal movement (edges in the graph). a a a a a 1 4.6 5 a a a a a 3 a a a a a 3.12 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 10 / 54

Extending the MTP problem 1 10 s a a a a 2 3.5 a a a a a Changing the scores to real numbers. Adding diagonal movement (edges in the graph). a a a a a 1 4.6 5 a a a a a 3 a a a a a 3.12 a a a a t S i,j = max S i 1,j + score of the edge between(i 1, j)and(i, j) S i,j 1 + score of the edge between(i, j 1)and(i, j) S i 1,j 1 + score of the edge between(i 1, j 1)and(i, j) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 10 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 11 / 54

Sequence alignment Definition (Global sequence alignment problem) S 1 and S 2 two strings over an alphabet Σ. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 12 / 54

Sequence alignment Definition (Global sequence alignment problem) S 1 and S 2 two strings over an alphabet Σ. s a scoring matrix. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 12 / 54

Sequence alignment Definition (Global sequence alignment problem) S 1 and S 2 two strings over an alphabet Σ. s a scoring matrix. A sequence alignment is obtained by inserting gaps into S 1 and S 2, so that the symbols can be placed in one-to-one correspondence with each other. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 12 / 54

Sequence alignment Definition (Global sequence alignment problem) S 1 and S 2 two strings over an alphabet Σ. s a scoring matrix. A sequence alignment is obtained by inserting gaps into S 1 and S 2, so that the symbols can be placed in one-to-one correspondence with each other. The optimal global sequence alignment is a sequence alignment that has the optimal sum of scores, according to s, over the pairs of symbols that correspond to each other in the alignment. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 12 / 54

Sequence Alignment example Example S 1 = AGCGCGUU S 2 = GUCAGACG Example A G C G C G U U G U C A G A C G Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 13 / 54

Sequence Alignment example Example S 1 = AGCGCGUU S 2 = GUCAGACG The scoring matrix s to be -1 for mismatch/indel (space), 1 for match. Example A G C G C G U U G U C A G A C G -1 1-1 1-1 1-1 1 1-1 -1 An optimal alignment of S 1 and S 2 is scored -1. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 13 / 54

Adding some sequences to the grid We can extend the grid to represent an alignment between two sequences in the following way: We create a grid with size S 1 + 1 S 2 + 1 vertices. The additional row / column is for the gap sign ( - ). The scores on the edges will be as follows: - j j+1 s[ -,S 2 [j]] i a a s[s 1 [i], - ] s[s 1 [i], S 2 [j]] i+1 a a Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 14 / 54

A G C G C G U U G U C A G A C G -1-1 -1-1 -1-1 -1-1 s 0 0 0 0 0 0 0 0-1 -1-1 -1-1 -1-1 1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 15 / 54

A G C G C G U U G U C A G A C G -1-1 -1-1 -1-1 -1-1 s 0 0 0 0 0 0 0 0-1 -1-1 -1-1 -1-1 1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 15 / 54

The alignment table S - G U C A G A C G - 0-1 -2-3 -4-5 -6-7 -8 A -1-1 -2-3 -2-3 -4-5 -6 G -2 0-1 -2-3 -1-2 -3-4 C -3-1 -1 0-1 -2-2 -1-2 G -4-2 -2-1 -1 0-1 -2 0 C -5-3 -3-1 -2-1 -1 0-1 G -6-4 -4-2 -2-1 -2-1 1 U -7-5 -3-3 -3-2 -2-2 0 U -8-6 -4-4 -4-3 -3-3 -1 S 1 = AGCGCGUU S 2 = GUCAGACG Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 16 / 54

The alignment table S - G U C A G A C G - 0-1 -2-3 -4-5 -6-7 -8 S 1 = AGCGCGUU S 2 = GUCAGACG A -1-1 -2-3 -2-3 -4-5 -6 G -2 0-1 -2-3 -1-2 -3-4 C -3-1 -1 0-1 -2-2 -1-2 A G G U C C A G G A C C G G U U G -4-2 -2-1 -1 0-1 -2 0 C -5-3 -3-1 -2-1 -1 0-1 G -6-4 -4-2 -2-1 -2-1 1 U -7-5 -3-3 -3-2 -2-2 0 U -8-6 -4-4 -4-3 -3-3 -1 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 16 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 17 / 54

Constraint Sequence Alignment Numerous studies suggest the application of additional constraints to sequence alignment for the purpose of improved speed or accuracy. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 18 / 54

Constraint Sequence Alignment Numerous studies suggest the application of additional constraints to sequence alignment for the purpose of improved speed or accuracy. The additional constraints can reflect a priori knowledge of the alignment and, therefore, narrows the problem search space or guides the search towards a preferred solution. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 18 / 54

Related Work Position anchoring [Myers-96, Sammeth-03] Demanding that the path will pass in certain cells in the table. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 19 / 54

Related Work Spaced seeds [Ma-02, Kucherov-05, Benson-06] Constraint on the path in the form of a partial word. Partial words are alignments based on letters 1 (match) and * (dont-care). For example: 11*11* will allow 111110 and also 110111. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 19 / 54

Related Work Regular Expression Constraint Sequence Alignment (RECSA) [Arslan-05] Each string should satisfy a regular expression constraint. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 19 / 54

Related Work SA-REPC Constraint on the path in the form of a regular expression. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 19 / 54

Related Work Position anchoring [Myers-96, Sammeth-03] Spaced seeds [Ma-02, Kucherov-05, Benson-06] Regular Expression Constraint Sequence Alignment (RECSA) [Arslan-05] SA-REPC Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 20 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 21 / 54

Preliminaries An extended definition of sequence alignment with alignment-path constraints. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 22 / 54

Preliminaries An extended definition of sequence alignment with alignment-path constraints. Example The constraint is in the form of a regular expression. S 1 = AGCGCGUU S 2 = GUCAGACG R = 11011 (1 - match, 0 - everything else) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 22 / 54

Preliminaries An extended definition of sequence alignment with alignment-path constraints. Example The constraint is in the form of a regular expression. S 1 = AGCGCGUU S 2 = GUCAGACG R = 11011 (1 - match, 0 - everything else) A G C G C G U U G U C A G A C G 1 1 0 1 1 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 22 / 54

Preliminaries - Alignment alphabet examples Σ r = {1, 0} 1 match 0 any other Example The letters A and A are mapped to 1. U and are mapped to 0. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54

Preliminaries - Alignment alphabet examples Σ r = {m, s, i, d} m s i d match substitution insertion deletion Example The letters A and A are mapped to m. U and are mapped to d. and A are mapped to i. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54

Preliminaries - Alignment alphabet examples Σ r = { σ1 σ σ 1, σ 2 Σ } { { } \ } 2 Each letter is mapped to a different symbol in the alignment alphabet Example The letters A and U are mapped to A U in the alignment alphabet and A, to A -. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54

Preliminaries - Alignment alphabet examples Because some Σ r symbols can be mapped from different symbols in Σ we need a mapping function f defined as: f : Σ Σ P(Σ r ) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54

Preliminaries - Alignment alphabet examples Because some Σ r symbols can be mapped from different symbols in Σ we need a mapping function f defined as: f : Σ Σ P(Σ r ) Example In Σ r = {0, 1} { σ 1 σ 2 σ 1, σ 2 Σ { } f (A, A) = {1, A A }, f (A, U) = {0, A U } } { \ }: Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54

Sequence Alignment with a Regular Expression Path Constraint Definition (Global SA-REPC ) S 1 and S 2 be two strings over an alphabet Σ. s a scoring matrix over alphabet Σ. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 24 / 54

Sequence Alignment with a Regular Expression Path Constraint Definition (Global SA-REPC ) S 1 and S 2 be two strings over an alphabet Σ. s a scoring matrix over alphabet Σ. R a regular expression over an alignment alphabet Σ r. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 24 / 54

Sequence Alignment with a Regular Expression Path Constraint Definition (Global SA-REPC ) S 1 and S 2 be two strings over an alphabet Σ. s a scoring matrix over alphabet Σ. R a regular expression over an alignment alphabet Σ r. Definition Find an alignment of S 1 and S 2 such that two conditions hold: 1 There exists an accepted region in the alignment belonging to L R. 2 The overall score of the alignment, computed according to s, is optimal among all such alignments. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 24 / 54

Sequence Alignment vs. SA-REPC Example (input) S 1 = AGCGCGUU S 2 = GUCAGACG s be a scoring matrix: match +1, all other -1. Example (Sequence Alignment) A G C G C G U U G U C A G A C G Optimal alignment value = -1 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 25 / 54

Sequence Alignment vs. SA-REPC Example (input) S 1 = AGCGCGUU S 2 = GUCAGACG s be a scoring matrix: match +1, all other -1. R = 10 3 10 Example (Sequence Alignment) Example (SA-REPC ) A G C G C G U U A G C G C G U U G U C A G A C G Optimal alignment value = -1 G U C A G A C G 1 0 0 0 1 0 Optimal alignment value = -3 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 25 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 26 / 54

Modifications in the automaton Regular expression - R 1 (1 0)1 2 0 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 27 / 54

Modifications in the automaton Regular expression - R 1 (1 0)1 2 0 Automaton - A R 1 0 q 0 0 / 1 1 1 start q 1 q 2 q 3 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 27 / 54

Modifications in the automaton Regular expression - R 1 (1 0)1 2 0 Automaton - A R 1 0 q 0 0 / 1 1 1 start q 1 q 2 q 3 Built Automaton - A Σ 1 0 Σ start q init ɛ 0 / 1 1 1 q 0 q 1 q 2 q 3 ɛ q final Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 27 / 54

Dynamic Programming solution The DP solution We calculate a dynamic programming table M Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54

Dynamic Programming solution The DP solution We calculate a dynamic programming table M Each cell M[i, j] holds Q entries Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54

Dynamic Programming solution The DP solution We calculate a dynamic programming table M Each cell M[i, j] holds Q entries Cell M[i, j](q) holds the optimal score of aligning S 1 [1, i] with S 2 [1, j] such that there is a run on A which reached q Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54

Dynamic Programming solution The DP solution We calculate a dynamic programming table M Each cell M[i, j] holds Q entries Cell M[i, j](q) holds the optimal score of aligning S 1 [1, i] with S 2 [1, j] such that there is a run on A which reached q If no such alignment suffix exists, then the value of the entry M[i, j](q) is null Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54

Dynamic Programming solution The DP solution We calculate a dynamic programming table M Each cell M[i, j] holds Q entries Cell M[i, j](q) holds the optimal score of aligning S 1 [1, i] with S 2 [1, j] such that there is a run on A which reached q If no such alignment suffix exists, then the value of the entry M[i, j](q) is null The answer is in M[n, m](q final ). Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54

Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 29 / 54

Single cell calculation The calculation of a single cell M[i, j] under the assumptions: S 1 [i] = S 2 [j] = C s[ C, C ] = 1 s[ C, - ] = s[ -, C ] = 0 Σ 1 0 Σ A = start q init ɛ 0 / 1 1 1 q 0 q 1 q 2 q 3 ɛ q final Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 30 / 54

Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 31 / 54

Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Example M[i, j](q 1 ) = M[i 1, j 1](q 0 ) + s[ C, C ] = M[i 1, j 1](q 0 ) + 1 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 31 / 54

Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Example M[i, j](q 1 ) = M[i 1, j](q 0 ) + s[ C, - ] = M[i 1, j](q 0 ) + 0 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 31 / 54

Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Example M[i, j](q 1 ) = M[i, j 1](q 0 ) + s[ -, C ] = M[i, j 1](q 0 ) + 0 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 31 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 32 / 54

Complexity analysis We denote: t = Q the number of states in A. n to be the length of S 1 m to be the length of S 2 Method Trace Time (NFA) Time (DFA) Memory naïve O(mnt 2 ) O(mnt) O(mnt) naïve O(mnt 2 ) O(mnt) O(min{m, n}t) Hirschberg O(mnt 2 ) O(mnt) O(min{m, n}t) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 33 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 34 / 54

The Cell Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 35 / 54

Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 36 / 54

The central dogma Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 37 / 54

A short movie Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 38 / 54

mrna regions Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 39 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 40 / 54

micrornas 1 micrornas are short sequences of RNA (approximately 22 bases). Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 41 / 54

micrornas 1 micrornas are short sequences of RNA (approximately 22 bases). 2 Function as specific gene regulators. A cell function at any given time is determined by the composition of proteins in it. micrornas suppress the translation of RNA to Protein. transcription translation DNA RNA Protein Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 41 / 54

micrornas 1 micrornas are short sequences of RNA (approximately 22 bases). 2 Function as specific gene regulators. 3 Operate by binding to complementary sequences on their mrna target (this interaction is called: hybridization). Hybridization is chemical bonding of bases (also called base pairing) A:U G:C G:U Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 41 / 54

micrornas 1 micrornas are short sequences of RNA (approximately 22 bases). 2 Function as specific gene regulators. 3 Operate by binding to complementary sequences on their mrna target (this interaction is called: hybridization). 4 The complex created by hybridization of the microrna to its mrna target is called a duplex. Figure: picture from Lin et al. 2003 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 41 / 54

Another short movie Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 42 / 54

Hybridization and Sequence alignment Hybridization of two sequences can be solved with the standard sequence alignment framework. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 43 / 54

Hybridization and Sequence alignment Hybridization of two sequences can be solved with the standard sequence alignment framework. Example The only difference is the scoring scheme. In sequence alignment a match is when both symbols are the same. In hybridization a match is when the two symbols are complementary. The matching pairs are: A:U, G:C and G:U. C U C G U G A U A C A C U U U G U U Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 43 / 54

Duplex properties Different properties of the microrna to target duplex were observed, some of which serve as a basis for current microrna target prediction algorithms. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54

Duplex properties 1 5 -end dominant seed: Several studies suggest the existence of a 6-8 nucleotides in the 5 -end of the microrna (the seed ). 5 Seed The 5 end of the seed is unpaired or starts with U, and doesn t contain wobble pairs (G:U). Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54

Duplex properties 1 5 -end dominant seed: Several studies suggest the existence of a 6-8 nucleotides in the 5 -end of the microrna (the seed ). 2 3 -end compensatory seed: There is significant evidence that a 3 -end seed of microrna can compensate for a non-perfect 5 -seed. 3 Seed Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54

Duplex properties 1 5 -end dominant seed: Several studies suggest the existence of a 6-8 nucleotides in the 5 -end of the microrna (the seed ). 2 3 -end compensatory seed: There is significant evidence that a 3 -end seed of microrna can compensate for a non-perfect 5 -seed. 3 Multiplicity: micrornas have been shown to be capable of functioning in a collaborative manner. There are two types of multiplicity: microrna microrna1 microrna2 Target Target Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54

Duplex properties 1 5 -end dominant seed: Several studies suggest the existence of a 6-8 nucleotides in the 5 -end of the microrna (the seed ). 2 3 -end compensatory seed: There is significant evidence that a 3 -end seed of microrna can compensate for a non-perfect 5 -seed. 3 Multiplicity: micrornas have been shown to be capable of functioning in a collaborative manner. 4 Accessibility and Thermodynamics: Thermodynamics and accessibility of the duplex and its surroundings area are very important properties. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 45 / 54

Using the current dogma on duplexes Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 46 / 54

Utilizing SA-REPC for microrna target prediction Some properties of the duplex can be written as a regular expression constraint. 5 -end dominant seed: ( i A G A A A C ) WCB 5 7 ii (WCB) 6 Where: WCB = 3 -end compensatory seed: ( G C C G A U U A ) 1 4 5 s 0 2 Inner buldge of the duplex: ( i 1 4 d 1 6)? ( 11 + ( i 1 4 d 1 6)) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 47 / 54

More complex duplex properties Thermodynamics and Accessibility of the duplex site and its surroundings are more complex properties. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 48 / 54

More complex duplex properties Thermodynamics and Accessibility of the duplex site and its surroundings are more complex properties. Both properties are the computational bottlenecks. Thermodynamics: microrna-target hybridization tends to have low free energy. Accessibility: Target site accessibility plays an important role in the formation of the duplex. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 48 / 54

More complex duplex properties Thermodynamics and Accessibility of the duplex site and its surroundings are more complex properties. Both properties are the computational bottlenecks. The complexity of such computations ranges from O(nm 2 ) [Stadler-06] (with restrictions) and up to O(nm 5 ) [Hofacker-08]. Thermodynamics: microrna-target hybridization tends to have low free energy. Accessibility: Target site accessibility plays an important role in the formation of the duplex. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 48 / 54

More complex duplex properties Thermodynamics and Accessibility of the duplex site and its surroundings are more complex properties. Both properties are the computational bottlenecks. The complexity of such computations ranges from O(nm 2 ) [Stadler-06] (with restrictions) and up to O(nm 5 ) [Hofacker-08]. We suggest using our method as an initial filter for target prediction tools that rely on energy computation. Thermodynamics: microrna-target hybridization tends to have low free energy. Accessibility: Target site accessibility plays an important role in the formation of the duplex. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 48 / 54

Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 49 / 54

Target prediction Implementation Implemented the tool in a java package named: calign. A web version is available at: http://www.cs.bgu.ac.il/ negevcb/calign Our data set 99 micrornas. 640 3 UTRs of human genes (2183 transcripts). 873 verified duplexes from mirecords. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 50 / 54

Comparative Results Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 51 / 54

Results Tool # of predicted pairs # of True Positives Sensitivity miranda 22,857 309 35.3% PITA 28,032 661 75.7% RNA hybrid 43,693 731 83.7% calign 43,210 626 71.7% Table: Results on all 63,360 pairs Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 52 / 54

Conclusions Conclusions Extended Sequence alignment to support a path constraint (SA-REPC ). Presented an application for our algorithm. Implemented the algorithm (calign). Showed preliminary comparative results. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 53 / 54

Conclusions Conclusions Extended Sequence alignment to support a path constraint (SA-REPC ). Presented an application for our algorithm. Implemented the algorithm (calign). Showed preliminary comparative results. Future work Find more properties of duplexes that can be used in SA-REPC. Find more applications for SA-REPC. Maybe extended to more general language classifications, such as grammars. An interesting open problem might be the application of some of the techniques previously used to obtain sub-quadratic sequence alignment, such as Four Russians and acceleration by compression. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 53 / 54

Acknowledgements Special Thanks Tamar Pinhas Co-Author Dr. Michal Ziv-Ukelson My Advisor The rest of Michal s group at BGU Erez Katznelson Isana Vaksler Sivan Yogev Shay Zakov Noa Mussa Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 54 / 54