Likelihood vs. Information in Aligning Biopolymer Sequences. UCSD Technical Report CS Timothy L. Bailey

Similar documents
Information Retrieval Advanced IR models. Luca Bondi

ON INDEPENDENT SETS IN PURELY ATOMIC PROBABILITY SPACES WITH GEOMETRIC DISTRIBUTION. 1. Introduction. 1 r r. r k for every set E A, E \ {0},

A Bijective Approach to the Permutational Power of a Priority Queue

The Substring Search Problem

Identification of the degradation of railway ballast under a concrete sleeper

Do Managers Do Good With Other People s Money? Online Appendix

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

This is a very simple sampling mode, and this article propose an algorithm about how to recover x from y in this condition.

Nuclear size corrections to the energy levels of single-electron atoms

FW Laboratory Exercise. Survival Estimation from Banded/Tagged Animals. Year No. i Tagged

4/18/2005. Statistical Learning Theory

New problems in universal algebraic geometry illustrated by boolean equations

Pearson s Chi-Square Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted Histograms

6 PROBABILITY GENERATING FUNCTIONS

TESTING THE VALIDITY OF THE EXPONENTIAL MODEL BASED ON TYPE II CENSORED DATA USING TRANSFORMED SAMPLE DATA

Surveillance Points in High Dimensional Spaces

Moment-free numerical approximation of highly oscillatory integrals with stationary points

Chem 453/544 Fall /08/03. Exam #1 Solutions

APPLICATION OF MAC IN THE FREQUENCY DOMAIN

3.1 Random variables

The geometric construction of Ewald sphere and Bragg condition:

Absorption Rate into a Small Sphere for a Diffusing Particle Confined in a Large Sphere

Analytical Solutions for Confined Aquifers with non constant Pumping using Computer Algebra

Safety variations in steel designed using Eurocode 3

Duality between Statical and Kinematical Engineering Systems

Goodness-of-fit for composite hypotheses.

Multiple Criteria Secretary Problem: A New Approach

Relating Branching Program Size and. Formula Size over the Full Binary Basis. FB Informatik, LS II, Univ. Dortmund, Dortmund, Germany

CALCULATING THE NUMBER OF TWIN PRIMES WITH SPECIFIED DISTANCE BETWEEN THEM BASED ON THE SIMPLEST PROBABILISTIC MODEL

Scattering in Three Dimensions

EM Boundary Value Problems

STUDY ON 2-D SHOCK WAVE PRESSURE MODEL IN MICRO SCALE LASER SHOCK PEENING

Bayesian Analysis of Topp-Leone Distribution under Different Loss Functions and Different Priors

arxiv: v1 [math.nt] 12 May 2017

Fresnel Diffraction. monchromatic light source

On a Simple Derivation of the Effect of Repeated. Measurements on Quantum Unstable Systems by Using. the Regularized Incomplete -Function

NOTE. Some New Bounds for Cover-Free Families

On the Poisson Approximation to the Negative Hypergeometric Distribution

ASTR415: Problem Set #6

PROBLEM SET #1 SOLUTIONS by Robert A. DiStasio Jr.

Probablistically Checkable Proofs

Pushdown Automata (PDAs)

The Millikan Experiment: Determining the Elementary Charge

Lifting Private Information Retrieval from Two to any Number of Messages

Hypothesis Test and Confidence Interval for the Negative Binomial Distribution via Coincidence: A Case for Rare Events

Gradient-based Neural Network for Online Solution of Lyapunov Matrix Equation with Li Activation Function

A Multivariate Normal Law for Turing s Formulae

Rotor Blade Performance Analysis with Blade Element Momentum Theory

I. Introduction to ecological populations, life tables, and population growth models

2. The Munich chain ladder method

F g. = G mm. m 1. = 7.0 kg m 2. = 5.5 kg r = 0.60 m G = N m 2 kg 2 = = N

ANA BERRIZBEITIA, LUIS A. MEDINA, ALEXANDER C. MOLL, VICTOR H. MOLL, AND LAINE NOBLE

Hua Xu 3 and Hiroaki Mukaidani 33. The University of Tsukuba, Otsuka. Hiroshima City University, 3-4-1, Ozuka-Higashi

Determining solar characteristics using planetary data

A NEW VARIABLE STIFFNESS SPRING USING A PRESTRESSED MECHANISM

Inseting this into the left hand side of the equation of motion above gives the most commonly used algoithm in classical molecula dynamics simulations

A pathway to matrix-variate gamma and normal densities

Hydroelastic Analysis of a 1900 TEU Container Ship Using Finite Element and Boundary Element Methods

A Newtonian equivalent for the cosmological constant

Lecture 5 Solving Problems using Green s Theorem. 1. Show how Green s theorem can be used to solve general electrostatic problems 2.

Syntactical content of nite approximations of partial algebras 1 Wiktor Bartol Inst. Matematyki, Uniw. Warszawski, Warszawa (Poland)

Long-range stress re-distribution resulting from damage in heterogeneous media

PES 3950/PHYS 6950: Homework Assignment 6

STABILITY AND PARAMETER SENSITIVITY ANALYSES OF AN INDUCTION MOTOR

CSCE 478/878 Lecture 4: Experimental Design and Analysis. Stephen Scott. 3 Building a tree on the training set Introduction. Outline.

Alternative Tests for the Poisson Distribution

Contact impedance of grounded and capacitive electrodes

Encapsulation theory: the transformation equations of absolute information hiding.

Nuclear and Particle Physics - Lecture 20 The shell model

arxiv: v1 [math.co] 1 Apr 2011

Uniform Circular Motion

Empirical Prediction of Fitting Densities in Industrial Workrooms for Ray Tracing. 1 Introduction. 2 Ray Tracing using DRAYCUB

Vanishing lines in generalized Adams spectral sequences are generic

HOW TO TEACH THE FUNDAMENTALS OF INFORMATION SCIENCE, CODING, DECODING AND NUMBER SYSTEMS?

Exploration of the three-person duel

Between any two masses, there exists a mutual attractive force.

Effect of drag on the performance for an efficient wind turbine blade design

A New Method of Estimation of Size-Biased Generalized Logarithmic Series Distribution

FUSE Fusion Utility Sequence Estimator

Directed Regression. Benjamin Van Roy Stanford University Stanford, CA Abstract

16 Modeling a Language by a Markov Process

On the Sun s Electric-Field

C e f paamete adaptation f (' x) ' ' d _ d ; ; e _e K p K v u ^M() RBF NN ^h( ) _ obot s _ s n W ' f x x xm xm f x xm d Figue : Block diagam of comput

On asymptotically optimal methods of prediction and adaptive coding for Markov sources

OSCILLATIONS AND GRAVITATION

Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline. Machine. Learning. Problems. Measuring. Performance.

ONE-POINT CODES USING PLACES OF HIGHER DEGREE

A generalization of the Bernstein polynomials

Teachers notes. Beyond the Thrills excursions. Worksheets in this book. Completing the worksheets

Stanford University CS259Q: Quantum Computing Handout 8 Luca Trevisan October 18, 2012

Aalborg Universitet. Load Estimation from Natural input Modal Analysis Aenlle, Manuel López; Brincker, Rune; Canteli, Alfonso Fernández

Physics 235 Chapter 5. Chapter 5 Gravitation

A new approach in classical electrodynamics to protect principle of causality

The Congestion of n-cube Layout on a Rectangular Grid S.L. Bezrukov J.D. Chavez y L.H. Harper z M. Rottger U.-P. Schroeder Abstract We consider the pr

Quasi-Randomness and the Distribution of Copies of a Fixed Graph

Outline. Gene clusters in comparative genomics: Accident or design? New genes come from... Evolution of vertebrate genomes

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Physics Department Physics 8.07: Electromagnetism II September 15, 2012 Prof. Alan Guth PROBLEM SET 2

arxiv: v1 [math.co] 4 May 2017

7.2. Coulomb s Law. The Electric Force

DOING PHYSICS WITH MATLAB COMPUTATIONAL OPTICS

Transcription:

Likelihood vs. Infomation in Aligning Biopolyme Sequences UCSD Technical Repot CS93-318 Timothy L. Bailey Depatment of Compute Science and Engineeing Univesity of Califonia, San Diego 1 Febuay, 1993 ABSTRACT: Biopolyme sequences often contain egions of similaity with othe sequences due to homology o common function. A common method of discoveing pattens in biopolyme sequences is to align a set of sequences so that cetain columns of the alignment have highly non-andom esidue fequency distibutions. The patten can then be descibed in tems of a consensus patten, motif, pole, speci- city matix o egula expession. This eseach note shows that a commonly used method of measuing the \goodness" of an alignment based on infomation theoy is actually equivalent to maximizing the likelihood atio of two hypotheses when the assumed pobability distibution is multinomial. In addition, a method which has been used by othe wokes fo detemining whethe a new sequence contains the patten is shown to be essentially equivalent to a likelihood atio. This oes a new, unifom way of thinking about the infomation contained in a set of aligned sequences which is moe intuitive, and may aid the development of impoved algoithms. 1 Intoduction It is useful to discove pattens in biopolyme sequences such as DNA, RNA o poteins fo numeous easons. The pattens may shed light on the stuctue and function of the sequences. The pattens may also be used fo classifying new sequences as containing o not containing the patten. Pattens in biopolyme sequences ae known to exist that eect common evolutionay oigins of the sequences, common functions and common seconday and tetiay stuctue. Pattens in biopolyme sequences can be discoveed eithe by laboatoy expeiments o by examining sets of sequences known to shae a common function, stuctue o evolutionay oigin. Laboatoy expeiments ae expensive, so looking fo pattens in sets of sequences is an attactive altenative o supplement to lab wok. The pattens discoveed can be diectly infomative about the biopolymes, can be used to classify new biopolymes and can be used to diect futue laboatoy expeiments on biopolymes which appea to contain a patten of inteest. 1 Fo coespondence: Depatment of Compute Science and Engineeing, Univesity of Califonia, San Diego, La Jolla, Califonia 92093-0114, (619) 534-8187, tbailey@cs.ucsd.edu. 1

One method of examining sets of biopolyme sequences fo potential common pattens is to align them eithe by hand o by compute and then look fo columns which contain \unlikely" distibutions of esidues. Each sequence is teated as a sting of lettes ove the appopiate alphabet (i.e. A, C, G, T fo DNA sequences.) The sequences ae witten hoizontally and aligned so that the common egions in each sting stat in the same columns of the alignment. It may be necessay to inset gaps in some of the sequences in ode to accomplish the alignment. Usually, the sequences ae longe than the common patten they shae. Fo the puposes of this eseach note, it is assumed that the patten being seached fo is known o assumed to be W esidues long. The pattens discoveed by aligning a set of sequences can be descibed as consensus pattens [Chappey et al., 1991], motifs [Staden, 1990], poles [Gibskov et al., 1990], specicity matices [Hetz et al., 1990] o egula expessions. Hetz, Hatzell and Stomo [Hetz et al., 1990] descibe a successful pogam which automatically aligns sets of sequences, poduces a specicity matix descibing the discoveed patten and detemines how well new sequences match the patten. The pogam must scoe vaious possible alignments to detemine which alignment is best. It uses what I will call an \Alignment Scoe". It also must detemine if a new sequence matches the patten epesented by the optimum alignment. It computes a what I call a \match scoe" and compaes it to a theshold. The theshold is computed by computing the match scoe fo many sequences believed not to match the patten, and choosing a numbe lage than the maximum match scoe thus found. Hetz, Hatzell and Stomo's pogam uses an alignment scoe based on infomation theoy. This scoe was st descibed in [Schneide et al., 1986]. The total alignment scoe is the sum of the scoe fo each of the columns in the alignment window, whee W, the length of the window, is chosen in advance. Alignment Scoe = WX col=1 I(col) The column alignment scoe I(col) is a measue of how unlikely the obseved distibution of esidues in a given column of the alignment window is. The alignment scoe fo a single column is calculated as I(col) = f log2 f (1) p whee is a esidue, M is the numbe of dieent types of esidues (i.e., M = 4 fo DNA, M = 20 fo poteins), p is the genomic fequency of esidue (i.e., the a pioi estimate of the fequency of esidue ), and f is the fequency of esidue in column col of the aligned sequences. No deivation o motivation fo the column alignment scoe I(col) is given in eithe [Schneide et al., 1986] o [Hetz et al., 1990]. Pesumably its motivation is based on infomation theoetic aguments. It can be noted that I(col) is elated to the elative entopy of two pobability distibutions fo the esidues in a column. In paticula, if the esidues ae assumed to be equipobable, that is, p = 1=M fo 1 M, then I(col) = f log2 f p 2

= = f log2f? f log2p f log2f? Mlog2(1=M) = f log2f + Mlog2M = H(1=M)? H(f) whee H(1=M) is the entopy of a message with M equipobable esults and H(f) is the entopy of a message with M esults with pobabilities f i fo 1 i M. It is not clea to this autho what the meaning of I(col) is in tems of infomation theoy when the a pioi distibution is skewed (i.e., not p = 1=M fo 1 M.) An attempt to econstuct the motivation fo I(col) led to the eseach fo this note. To evaluate the stength of the match between the patten dened by an alignment and a new sequence of length W, [Hetz et al., 1990] use the sum of a match scoe fo each column in the alignment window. Match Scoe = WX col=1 Scoe(col) The column match scoe Scoe(col) measues how well the esidue in a column of the new sequence matches the pediction made by patten discoveed in the aligned sequences. The column match scoe fo a new sequence which has esidue in column col is n + 1 Scoe(col) = log2 (2) (N + 1)p whee N is the numbe of sequences being aligned, n is the numbe of times esidue occued in column col of the alignment, and p is the same as fo I(col). The motivation fo the column match scoe Scoe(col) is given in [Hetz et al., 1990] in tems of how much the pobability of the obseved fequency distibution would change if the new sequence wee added to the alignment. The pobability of obseving esidue exactly n times fo 1 M was assumed to be given by the multinomial distibution P = N! Q M(n )! It will be shown that maximizing I(col) (esp. Scoe(col)) is equivalent to maximizing the log-likelihood atio of two hypotheses given that the pobability model is the multinomial distibution ove N (esp. 1) independent tials. Fo I(col), the equivalence with a log-likelihood atio maximization is exact, fo Scoe(col) the equivalence is appoximate with the discepancy becoming smalle as N, the numbe of sequences in the alignment, inceases. Section 2 will demonstate the equivalence of the infomation-based and likelihoodbased alignment and match scoes. Section 3 discusses why the likelihood-based scoes make intuitive sense and the implications fo futue eseach on algoithms fo aligning biopolyme sequences. 3 p n f

2 Equivalence of Scoes Based on Infomation and Likelihood One method of choosing between two hypotheses given some obseved data uses the concept of the likelihood atio [Edwads, 1972]. In this method, you st choose a pobability model that is assumed to descibe the pocess that geneates the data. Competing hypotheses ae descibed in tems of paametes of the pobability model. The object is to nd the values of the paametes which ae best suppoted by the data. The method is to choose the values of the paametes which would be moe fequently geneate the obseved data. This occus when the value of the likelihood atio is geate than 1. The likelihood atio is dened in tems of the likelihood function. The likelihood function fo the multinomial distibution given some obseved data R is L(jR) = k P () n whee k is an abitay constant, M is the numbe of classes, P () is the pobability of a sample being in class on any given tial, and n is the numbe of samples duing N independent tials that belonged to class. The likelihood atio fo hypothesis 1 vesus 2 given data R is dened as L(1; 2jR) = L( 1) L(2) Fo the multinomial pobability model and hypotheses 1 and 2 such that the likelihood atio can be witten as P (1) = f ; 1 M P (2) = p ; 1 M f n L(1; 2jR) = p n 4

Theoem 1: Maximizing L(1; 2jR) is equivalent to maximizing I(col), whee the obseved data R ae the esidues in the given column of the aligned sequences. (Notice that this data is also used to compute the values n and f.) Poof: Since f(x) = x 1=N is monotonic, inceasing fo x 0, and L(1; 2jR) 0, maximizing L(1; 2jR) is equivalent to maximizing L(1; 2jR) 1=N f n = ( ) 1=N = p n Since log(x) is monotonic, inceasing, this is equivalent to maximizing f f log2( ) = p f f f p f f log2( f p ) = I(col) Theoem 2: Maximizing L(1; 2jR) is essentially equivalent to maximizing Scoe(col), whee the obseved data R is the single esidue that the new sequence has in the given column. Poof: Once again, we take the logaithm of the likelihood function and note that maximimizing the log-likelihood is equivalent to maximizing the likelihood log2(l(1; 2jR)) = log2 f p n + 1 log2 (N + 1)p = Scoe(col) The appoximation becomes bette as N inceases, since as N! 1, log2l(1; 2jR)! scoe(col) because (n + 1)=(N + 1)! f. 3 Discussion This eseach note has shown that a successful scoing system [Hetz et al., 1990] fo alignments of elated biolpolyme sequences based on infomation theoy is equivalent to a likelihood atio method. Also, a method of using a set of aligned sequences to evaluate whethe new sequences contain the same patten based on a pobabilistic agument is equivalent to the same likelihood atio method. The likelihood atio method has the advantages of making all of the assumptions upon which an infeence ae based explicit, and of being intuitively pleasing (at least to some.) It equies that the pobability model and altenative hypotheses be tested be clealy dened. The data is then used to detemine which hypothesis is bette suppoted. The likelihood atio can be intepeted opeationally as the elative fequency with which the obseved 5

data would be geneated by the two hypotheses [Edwads, 1972]. Using a single method fo justifying both the alignment and matching pocesses seems simple and moe open to analysis than a combination of infomation theoy and pobability. The pobability model used in this note fo the fequencies of esidues in a column of a set of aligned esidues is the multinomial distibution. This is a sensible model since each of the N sequences being aligned can be thought of as an independent sample. The two hypotheses which ae compaed in the likelihood atio in this note ae the hypothesis that the esidue pobabilities in the columns of the coectly aligned sequences ae the obseved esidue fequencies, vesus the hypothesis that the coect pobabilities ae the a pioi fequencies. Seaching fo the alignment with the highest likelihood atio can be viewed as looking fo the alignment such that the second hypothesis is ejected most stongly. This is a easonable way of detemining if a patten eally exists in the data that can be found by tying vaious alignments. The method of aligning sequences descibed in [Hetz et al., 1990] summed the scoes fo all the columns in the alignment window. Since the scoes ae equivalent to log-likelihood atios, this is equivalent to multiplying likelihood atios togethe. Futue eseach should examine the assumptions of independence among the columns of the alignment undelying this algoithm. It might also be useful to eplace Scoe(col) with the log-likelihood atio in cases whee thee ae few sequences, since that is when they will die the most. It would be inteesting to analytically dene the distibutions of I(col) and Scoe(col) in ode to set thesholds fo alignments and matches without esoting to lage sets of supposed negative examples. This might be easie using the likelihood fomalism than using the infomation theoetic and pobabilistic fomalisms of [Hetz et al., 1990]. 6

Refeences [Chappey et al., 1991] C. Chappey, A. Danckaet, P. Dessen, and S. Haxout. MASH: An inteactive pogam fo multiple alignment and consensus sequence constuction fo biological sequences. Compute Applications in Biosciences, 7(2):195{202, 1991. [Edwads, 1972] A. W. F. Edwads. Likelihood. Cambidge Univesity Pess, Cambidge, England, 1972. [Gibskov et al., 1990] Michael Gibskov, Roland Luthy, and David Eisenbeg. Pole analysis. Methods in Enzymology, 183:146{159, 1990. [Hetz et al., 1990] Geald Z. Hetz, Geoge W. Hatzell, III, and Gay D. Stomo. Identication of consensus pattens in unaligned DNA sequences known to be functionally elated. Compute Applications in Biosciences, 6(2):81{92, 1990. [Schneide et al., 1986] Thomas D. Schneide, Gay D. Stomo, Lay Gold, and Andzej Ehenfeucht. Infomation content of binding sites on nucleotide sequences. Jounal of Molecula Biology, 188:415{431, 1986. [Staden, 1990] Rodge Staden. Seaching fo pattens in potein and nucleic acid sequences. Methods in Enzymology, 183:193{210, 1990. 7