Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Similar documents
Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

A Simple Regression Problem

Block designs and statistics

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

1 Proof of learning bounds

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

Feature Extraction Techniques

Polygonal Designs: Existence and Construction

Topic 5a Introduction to Curve Fitting & Linear Regression

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Introduction to Discrete Optimization

COS 424: Interacting with Data. Written Exercises

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

Kernel Methods and Support Vector Machines

arxiv:math/ v1 [math.co] 22 Jul 2005

Estimating Parameters for a Gaussian pdf

Boosting with log-loss

Randomized Recovery for Boolean Compressed Sensing

List Scheduling and LPT Oliver Braun (09/05/2017)

Order Recursion Introduction Order versus Time Updates Matrix Inversion by Partitioning Lemma Levinson Algorithm Interpretations Examples

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Probability Distributions

A note on the multiplication of sparse matrices

What is Probability? (again)

Ufuk Demirci* and Feza Kerestecioglu**

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Combining Classifiers

In this chapter, we consider several graph-theoretic and probabilistic models

Homework 3 Solutions CSE 101 Summer 2017

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Ch 12: Variations on Backpropagation

Bayes Decision Rule and Naïve Bayes Classifier

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Machine Learning Basics: Estimators, Bias and Variance

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Lecture 9 November 23, 2015

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Training an RBM: Contrastive Divergence. Sargur N. Srihari

On the Maximum Number of Codewords of X-Codes of Constant Weight Three

Birthday Paradox Calculations and Approximation

Solutions of some selected problems of Homework 4

Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers Roy D. Yates and David J.

arxiv: v1 [cs.ds] 3 Feb 2014

CS Lecture 13. More Maximum Likelihood

Non-Parametric Non-Line-of-Sight Identification 1

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Pattern Recognition and Machine Learning. Artificial Neural networks

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Analyzing Simulation Results

Projectile Motion with Air Resistance (Numerical Modeling, Euler s Method)

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

arxiv: v2 [math.co] 8 Mar 2018

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

1 Bounding the Margin

3.3 Variational Characterization of Singular Values

Lower Bounds for Quantized Matrix Completion

Testing Properties of Collections of Distributions

Spine Fin Efficiency A Three Sided Pyramidal Fin of Equilateral Triangular Cross-Sectional Area

Introduction to Robotics (CS223A) (Winter 2006/2007) Homework #5 solutions

Least Squares Fitting of Data

Sampling How Big a Sample?

A Note on the Applied Use of MDL Approximations

Interactive Markov Models of Evolutionary Algorithms

The Methods of Solution for Constrained Nonlinear Programming

Acyclic Colorings of Directed Graphs

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

MODIFICATION OF AN ANALYTICAL MODEL FOR CONTAINER LOADING PROBLEMS

A1. Find all ordered pairs (a, b) of positive integers for which 1 a + 1 b = 3

arxiv: v2 [cs.lg] 30 Mar 2017

Fast Structural Similarity Search of Noncoding RNAs Based on Matched Filtering of Stem Patterns

National 5 Summary Notes

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Study on Markov Alternative Renewal Reward. Process for VLSI Cell Partitioning

Sharp Time Data Tradeoffs for Linear Inverse Problems

Best Procedures For Sample-Free Item Analysis

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Stochastic vertex models and symmetric functions

Lecture 21. Interior Point Methods Setup and Algorithm

16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Midterm 1 Sample Solution

arxiv: v1 [math.na] 10 Oct 2016

Feedforward Networks

Pattern Recognition and Machine Learning. Artificial Neural networks

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

PAC-Bayes Analysis Of Maximum Entropy Learning

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Exact tensor completion with sum-of-squares

Transcription:

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 31 11 Motif Finding Sources for this section: Rouchka, 1997, A Brief Overview of Gibbs Sapling. J. Buhler, M. Topa: Finding otifs using rando projections, RECOMB 2001, 69-75. P.A. Pevzner and S.-H. Sze, Cobinatorial approaches to finding subtle signals in DNA sequences, ISMB 2000, 269 278. The goal of otif finding is the detection of new, unknown signals in a set of sequences. For exaple, assue we have a collection of s that appear to be co-regulated and have the sae pattern of expression. You would like to search for a transcription factor binding site that occurs in the upstrea regions of all the s. search for coon otif here The ai is to copute soething like this, where we highlight a otif that occurs in all given DNA sequences: 11.1 The Gibbs sapling ethod We will first describe an algorith based on the Gibbs-sapler. Basic assuptions:

32 Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 Input is a set of DNA sequences s 1,..., s. Each sequence contains a copy or instance of the otif. The otif has a fixed length l. The otif is rated by a position weight atrix. 11.1.1 Sequence rating odels We describe two odels for randoly rating DNA sequences: 1. The background odel B is given by a vector that specifices a probability for each of the four character states. 2. The otif odel M specifies for each position i = 1, 2,... a probability M(i, c) for each of the four states c for a word of length l: A C G T 1 2 3... l We will refer to this as a position weight atrix, although usually it is required that a position weight atrix contains the logariths of the probabilities rather than the probabilities theselves. 11.1.2 Siple coputations If we know both M and B then we can score any word w of length l. In ore detail, with l l P (w) = B(w[i]) and Q(w) = M(i, w[i]) i=1 i=1 we can define the log-odds ratio as R(w) = log Q(w) P (w). If R(w) > 0, it is ore likely that the word w was rated by the otif odel. If R(w) < 0, then it is ore likely that w was rated by the rando odel. If both M and B are given, then we can ake a axiu likelihood prediction of the locations of the otif in sequence of the input sequences:

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 33 s s 1 s 2 3 a 1 a 3 a 2 s a In each sequence s i, choose a location a i such that the word w starting at a i axiizes R(w). Vice versa, if we are given the locations a 1,..., a of the instances of the otif in s 1,..., s, then we can setup M and B for all states c and positions i as follows: and M(i, c) = B(c) = # occurences of state c outside of otif # all bases outside of otif # occurrences of c at the i-th position of the otif. For exaple consider the following sequences and otif instances: A C G T A A G C G T T A A C T T T G We have B = and M = A C G T 1 2 3... l Note: to avoid zeros in either distributions, pseudo counts are added to all counts. The axiu likelihood score of a choice of otif locations a 1,..., a is defined as follows: L(a 1,..., a ) = L(a 1,..., a, M, B) = = P (s 1,..., s a 1,..., a, M, B) = ( ai ) ( 1 i=1 j=1 B(s ai +l 1 i[j]) j=a i ) M(j, s i [j]) ) ( si j=a i +l B(s i[j]) where s i = length of s i. In this forula, we use either the background odel or the otif odel to assign a probability to a given position in a given sequence depending on whether the position is outside of, or inside of, the current placeent of the otif, respectively., 11.1.3 Motif-finding dilea Motif-finding dilea: If we know M and B, then we can deterine the ost likely otif locations a 1,..., a in s 1,..., s.

34 Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 If we know a 1,..., a then we can deterine M and B. In otif-finding we know neither and therefore want to infer both. Idea: Construct M and B fro a i s fro 1 sequences and then use M and B to deterine a value for a i for reaining sequence s i. Repeat. The Gibbs-sapler is based on this idea. This ethod can be interpreted as an MCMC approach in which proposed new solutions are always accepted. 11.1.4 Gibbs sapling algorith Algorith 11.1.1 (Gibbs-sapler-based otif finding) Input: Sequences s 1,..., s Output: Motif M, background B, locations a 1,..., a Init.: Choose a 1,..., a either randoly or using an algorith, as described later. repeat for h = 1,..., do Copute M and B fro a 1,..., a h 1, a h+1,..., a (i.e., fro data with h-th sequence and otif reoved) Using M and B, copute a new location a h in s h ( ) Copute M and B fro a 1,..., a (i.e., fro data with new choice of h-th otif location inserted) Copute score L(a 1,..., a ) until score L(a 1,..., a ) has converged Return M, B and a 1,..., a end In line ( ) choose the new value a for a h probabistically according to the distribution of noralized log-odds scores: R(a) sh l+1 a =1 R(a). Here, R(a) is the score obtained for M and B applied to sequence s h and otif location a. 11.1.5 Exaple We will now discuss an exaple taken fro Rouchka (1997) 1. The otif length is l = 6. Here we show the input sequences and the initial choice of locations (in red): 1 Rouchka, 1997, A Brief Overview of Gibbs Sapling

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 35 The nuber of A s in all of the sequences cobined is 327, the nuber of C s is 317, the nuber of G s is 272, and the nuber of T s is 304. To avoid zero entries in the otif atrix M, we will use 10% of each of the background counts as pseudocounts that are always added to the different counts obtained for different entries of M when calculating the frequencies. The pseudo counts to be added to values in M equal that is 32.7, 31.7, 27.2 and 30.4 for A, C, G and T, respectively. We will not look at the details of this, but this explains why the tables containing observed counts don t translate directly into the tables containing frequencies. In the first iteration of the ain loop we reove the first sequence s 1 =TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT. Here are M and B for the first sequence reoved: We then score all 36 = 41 6 + 1 possible words of length l in s 1 =TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT.

36 Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 The colun headed noralized A x contains the noralized log-odd scores for each choice of otif location in s 1. The algorith chooses one of the locations according to the given distribution of log-odds scores. We then repeat the ain loop and reove the second sequence etc. This is repeated until the score coverges or until a set axiu nuber of iterations has been perfored. Here is the final choice of otif:

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 37 11.2 The Projection Method We will now describe the projection ethod. Again, the coputational proble is to deterine a otif by analyzing a set of sequences that contain instances of the otif. Here is another way to foralize the otif-finding proble (Pevzner and Sze): Planted (l, d)-motif Proble: Suppose there is a fixed but unknown nucleotide sequence M (the otif) of length l. The proble is to deterine M, given sequences, each of length n and each containing a planted variant of M. More precisely, each such planted variant is a substring of length l that differs fro M in at ost d positions. In the Planted (l, d)-motif Proble assue the otif is ACAGGATCA. The following 4 sequences each contain a planted version of this otif: AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC ATGATAGCATCAACCTAACCCTAGATATGGGAT TTTTGGGATATATCGCCCCTACACAGGATCACT GGATATACAGGATCACGGTGGGAAAACCCTGAC Notice that soe variants fully agree on a subset of the positions of the full otif: ACAGGcTCc AtAGcATCA ACAGGATCA ACAGGATCA The key idea is to repeatedly choose k rando positions fro the l positions of the otif and then to hash words fro the sequences using only those positions, in effect projecting to those k positions: ACAGGATCA project AAGTC If we happen to choose a set of k positions that are well conserved by the otif then we will obtain a hash bucket that contains a higher-than-expected nuber of hits. In other words, to address the Planted (l, d)-motif Proble, the key idea is to choose k of l positions at rando, then to use the k selected positions of each l-er w in a hash function h(x). When a larger-than-expected nuber of l-ers hash to the sae bucket, then it is likely that the bucket is enriched for instances of the planted otif M: S 1 x x xo o 1 S 2 x xo ox 2 S 3 x o oxx 3 hash to sae bucket S 4 x 4 xo xo (Here, for each instance i of the planted otif M, x s ark the d = 3 substitutions and o s ark the k = 2 positions used in hashing.) Like any probabilistic algoriths, the projection algorith perfors a nuber of independent runs of a basic iteration. In each such trial, it chooses a rando projection h and hashes each l-er x in the input sequences to its bucket h(x). Any hash bucket with a high nuber of entries is explored as a source of the planted otif, in a refineent step, as described below.

38 Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 If k < l d, then there is a good chance that soe of the planted instances of M will be hashed to the sae bucket (the planted bucket), naely all planted instances for which the k hash positions and d substituted positions are disjoint. So, there is a good chance that the planted bucket will be enriched for the planted otif, and will contain ore entries than an average bucket. 11.2.1 Exaple Assue we are given the sequences s 1 s 2 s 3 1234567 cagtaat ggaactt aagcaca. Let M = aaa be the (unknown) (3, 1)-otif. Hashing using the first two positions of every word of length 3 we get the following hash table: h(x) pos. aa (1,5), (2,3), (3,1) ac (2,4), (3,5) ag (1,2), (3,2) at (1,6) ca (1,1), (3,4), (3,6) cc h(x) pos. cg ct (2,5) ga (2,2) gc (3,3) gg (2.1) h(x) pos. gt (1,3) ta (1,4) tc tg tt (2,6) The otif M is planted at positions (1, 5), (2, 3), (3, 1) and (3, 5) and in this exaple, three of the four instances hash to the planted bucket h(m) = aa. 11.2.2 Choosing the paraeters Obviously, the algorith does not know which bucket is the actual planted bucket. So it considers every bucket that contains at least s eleents, where s is a suitable threshold. The algorith has three ain paraeters: the projection size k, the bucket (inspection) threshold s, and and the nuber of independent runs t. In the following, we will discuss how to choose each of these paraeters. projection size: Ideally, the algorith should hash a significant nuber of instances of the otif into the planted bucket, while avoiding containation of the planted bucket by rando background l-ers. To iniize the containation of the planted bucket, we ust choose k large enough. What size ust we choose k so that the average bucket will contain less than 1 rando l-er? Since we are hashing (n l + 1) l-ers into 4 k buckets, if we choose k such that 4 k > (n l + 1), then the average bucket will contain less than one rando l-er. For exaple, in a Planted (15, 4)-Proble, with = 20 and n = 600, we ust choose k to satisfy: k < l d = 15 4 = 11 and k > log((n l + 1)) log(4) = log(20(600 15 + 1)) log(4) 6.76.

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 39 If the total nuber of sequences is very large, then it ay be that one cannot choose k to satisfy both k < l d and 4 k > (n l + 1). In this case, set k = l d 1, as large as possible. Bucket threshold: A bucket size of s = 3 or 4 is practical, as we should not expect too any instances to hash to the sae bucket in a reasonable nuber of runs. Nuber of independent runs: We want to choose t so that the probability is at least q = 0.95 that the planted bucket contains s or ore planted otif instances in at least one of the t runs. Let ˆp(l, d, k) be the probability that a given planted otif instance hashes to the planted bucket, that is: ) ˆp(l, d, k) = Then the probability that fewer than s planted instances hash to the planted bucket in a given trial is B,ˆp(l,d,k) (s). Here, B,p (s) is the probability that there are fewer than s successes in independent Bernoulli trials, each having probability p of success. If the algorith is run for t runs, the probability that s or ore planted instances hash to the planted bucket is: 1 ( B,ˆp(l,d,k) (s) ) t For this to be q, choose t = ( l d k ( l k). log(1 q) log(b,ˆp(l,d,k) (s)). (11.1) Using this criterion for t, the choices for k and s above require at ost thousands of runs, and usually any fewer, to produce a bucket containing sufficiently any instances of the planted otif. 11.2.3 Motif refineent The ain loop of the projection algorith finds a set of buckets of size s. In the subsequent refineent step each such bucket is explored in an attept to recover the planted otif. The idea is that, if the current bucket is the planted bucket, then we have already found k of the planted otif residues. These, together with the reaining l k residues, should provide a strong signal that akes it easy to obtain the otif in only a few iterations of refineent. We will process each bucket of size s to obtain a candidate otif. Each of these candidates will be refined and the best refineent will be returned as the final solution. Candidate otifs are refined either using the Gibbs-sapler or a related approach called the EM algorith. In essence, the projection ethod is used to copute a good starting point and thus greatly speed-up the convergence of a ethod such as the Gibbs-sapler algorith.