CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * The contents are adapted from Dr. Jean Gao at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University

Primer on Probability Random Variable Probability Distribution Independence Bayes Theorem Means and Variances

Definition of Probability frequentist interpretation: the probability of an event is proportion of the time events of same kind will occur in the long run eamples the probability my flight to Chicago will be on time the probability this ticket will win the lottery the probability it will rain tomorrow always a number in the interval [0,1] 0 means never occurs 1 means always occurs

Sample Spaces sample space: a set of possible outcomes for some event eamples flight to Chicago: {on time, late, cancelled, accident} lottery:{ticket 1 wins, ticket 2 wins,,ticket n wins} weather tomorrow: {rain, not rain} or {sun, rain, snow} or {sun, clouds, rain, snow, sleet} or

Random Variables random variable: a variable representing the outcome of an eperiment eample X represents the outcome of my flight to Chicago we write the probability of my flight being on time as Pr(X = on-time) or when it s clear which variable we re referring to, we may use the shorthand Pr(on-time)

Notation Uppercase letters and capitalized words denote random variables Lowercase letters and uncapitalized words denote values We ll denote a particular value for a variable as follows Pr( X ) Pr( Fever true) we ll also use the shorthand form Pr( ) for Pr( X ) for Boolean random variables, we ll use the shorthand Pr( fever ) for Pr( Fever true) Pr( fever ) for Pr( Fever false )

Probability Distributions if X is a random variable, the function given by Pr(X = ) for each is the probability distribution function p(x) of X (also called probability mass function, pmf.) requirements: Pr( ) 0 for every 0.3 Pr( ) 1 0.2 0.1

Joint Distributions joint probability distribution of two RVs: the function given by Pr(X =, Y = y) read X equals and Y equals y eample, y Pr(X =, Y = y) sunny, on-time 0.20 rain, on-time 0.20 probability that it s sunny and my flight is on time snow, on-time 0.05 sunny, late 0.10 rain, late 0.30 snow, late 0.15

Marginal Distributions the marginal distribution of X is defined by Pr( ) Pr(, y) y the distribution of X ignoring other variables this definition generalizes to more than two variables, e.g. Pr( ) Pr(, y, z) y z

Marginal Distribution Eample joint distribution marginal distribution for X, y Pr(X =, Y = y) sunny, on-time 0.20 rain, on-time 0.20 snow, on-time 0.05 Pr(X = ) sunny 0.3 rain 0.5 snow 0.2 sunny, late 0.10 rain, late 0.30 snow, late 0.15

Joint/Marginal Distributions Two variable case The values of the joint distribution are in the 4 4 square, and the values of the marginal distributions are along the right and bottom margins. Ref: https://en.wikipedia.org/wiki/marginal_distribution

Conditional Distributions the conditional distribution of X given Y is defined as: Pr( X Y y) Pr( X P( Y, Y y) y) the distribution of X given that we know Y

Conditional Distribution Eample joint distribution conditional distribution for X given Y=on-time, y Pr(X =, Y = y) sunny, on-time 0.20 rain, on-time 0.20 snow, on-time 0.05 Pr(X = Y=on-time) sunny 0.20/0.45 = 0.444 rain 0.20/0.45 = 0.444 snow 0.05/0.45 = 0.111 sunny, late 0.10 rain, late 0.30 snow, late 0.15

Independence two random variables, X and Y, are independent if Pr(, y) Pr( ) Pr( y) for all and y

Independence Eample #1 joint distribution, y Pr(X =, Y = y) sunny, on-time 0.20 rain, on-time 0.20 snow, on-time 0.05 sunny, late 0.10 rain, late 0.30 snow, late 0.15 marginal distributions Pr(X = ) sunny 0.3 rain 0.5 snow 0.2 y Pr(Y = y) on-time 0.45 late 0.55 Are X and Y independent here? NO.

Independence Eample #2 joint distribution, y Pr(X =, Y = y) sunny, fly-united 0.27 rain, fly-united 0.45 snow, fly-united 0.18 sunny, fly-northwest 0.03 rain, fly-northwest 0.05 snow, fly-northwest 0.02 marginal distributions Pr(X = ) sunny 0.3 rain 0.5 snow 0.2 y Pr(Y = y) fly-united 0.9 fly-northwest 0.1 Are X and Y independent here? YES.

Conditional Independence two random variables X and Y are conditionally independent given Z if Pr( X Y, Z) Pr( X Z) once you know the value of Z, knowing Y doesn t tell you anything about X alternatively Pr(, y z) Pr( z) Pr( y z) for all, y, z

Conditional Independence Eample Flu Fever Vomit Pr true true true 0.04 true true false 0.04 true false true 0.01 true false false 0.01 false true true 0.009 false true false 0.081 false false true 0.081 false false false 0.729 Fever and Vomit are not independent: e.g. Pr( fever, vomit) Pr( fever ) Pr( vomit) Fever and Vomit are conditionally independent given Flu: Pr( fever, vomit flu) Pr( fever flu) Pr( vomit flu) Pr( fever, vomit flu) Pr( fever flu) Pr( vomit flu) etc.

Bayes Theorem Pr( y) Pr( y ) Pr( ) Pr( y) Pr( y ) Pr( ) Pr( y )Pr( ) this theorem is etremely useful there are many cases when it is hard to estimate Pr( y) directly, but it s not too hard to estimate Pr(y ) and Pr()

Bayes Theorem Eample MDs usually aren t good at estimating Pr(Disorder Symptom) they re usually better at estimating Pr(Symptom Disorder) if we can estimate Pr(Fever Flu) and Pr(Flu) we can use Bayes Theorem to do diagnosis Pr( flu fever ) Pr( fever Pr( fever flu)pr( flu) flu)pr( flu) Pr( fever flu)pr( flu)

Epected Values the epected value of a random variable that takes on numerical values is defined as: X E Pr( ) this is the same thing as the mean we can also talk about the epected value of a function of a random variable g( X ) g( ) E Pr( )

Epected Value Eamples E Shoesize 5 Pr(Shoesize 5)... 14 Pr( Shoesize 14) E gain( Lottery ) gain(winning) Pr(winning) gain(losing) Pr(losing) $1000.001 $10.999 $0.899

Markov Models in Computational Biology there are many cases in which we would like to represent the statistical regularities of some class of sequences genes various regulatory sites in DNA (e.g. where RNA polymerase and transcription factors bind) proteins in a given family Markov models are well suited to this type of task

References Bioinformatics Classic: Krogh et. al. (1998) An Introduction to Hidden Markov Models for Biological Sequences Computational Methods in Mol. Biol. 45-63 Book: Eddy & Durbin, Chapter 3. Jones and Pevzner: Chapter 11. Tutorial: Rabiner, L. (1989) A tutorial on hidden Markov models and selected applications in speech recognition Proc IEEE 77 (2) 257-286

Eample X: 35 years old customer with an income of $4,000 and fair credit rating H: Hypothesis that the customer will buy a computer

P(H X): probability that the customer will buy a computer given that we know his age, credit rating and income (posterior) P(H): probability that the customer will buy a computer regardless of age, credit rating, and income P(X H): probability that the customer is 35 year old, have fair credit rating and earns $40,000, given that he has bought our computer (likelihood) P(X): probability that a person from our set of customers is 35 year old, have fair credit rating and earns $40,000.

Maimum Likelihood Estimation (MLE) Strategy of estimating the parameters of a statistical model given observations Choose a value that maimizes the probability of observed data θ = argma P( θ), where is observation and θ is a parameter

Maimum a posteriori estimation (MAP) Estimating unobserved population (or making a decision) Choose a value that is most probable give observed data and prior belief: θ = argma P y

.34.16.38.12 transition probabilities 0.12 ) Pr( 0.38 ) Pr( 0.34 ) Pr( 0.16 ) Pr( 1 1 1 1 g t g g g c g a i i i i i i i i A Markov Chain Model A T C G begin state transition

Markov Chain Models a Markov chain model (a triplicate) is defined by a set of states some states emit symbols other states (e.g. the begin state) are silent a set of transitions with associated probabilities the transitions emanating from a given state define a distribution over the possible net states Initial state probabilities

Markov Chain Models given some sequence of length L, we can ask how probable the sequence is given based on our model for any probabilistic model of sequences, we can write this probability as Pr( ) Pr( L Pr( L, L1 L1,...,,..., 1 1 ) ) Pr( L1 L2,..., 1 )...Pr( key property of a (1 st order) Markov chain: the probability of each depends only on the value of Pr( ) i i 1 Pr( L Pr( 1 ) L1 L i2 ) Pr( Pr( i L1 i1 ) L2 )...Pr( 2 1 ) Pr( 1 1 ) )

The Probability of a Sequence for a Given Markov Chain Model A G begin C T Pr( cggt ) Pr( c)pr( g c)pr( g g)pr(t g )

Markov Chain Models can also have an end state; allows the model to represent a distribution over sequences of different lengths preferences for ending sequences with certain symbols A G begin end C T

Markov Chain Notation the transition parameters can be denoted by where a similarly we can denote the probability of a sequence as a B L i2 a Pr( i i 1) i1i Pr( 1 ) Pr( i i 1) 1 i1i i2 where represents the transition from the begin state L

Eample Application CpG islands CpG is a pair of nucleotides C and G, appearing successively in this order, along one DNA strand. CpG dinucleotides are rarer in eukaryotic genomes than epected given the marginal probabilities of C and G but the regions upstream of genes (promoter regions) are richer in CpG dinucleotides than elsewhere CpG islands useful evidence for finding genes could predict CpG islands with Markov chains one to represent CpG islands one to represent the rest of the genome

Estimating the Model Parameters given some data (e.g. a set of sequences from CpG islands), how can we determine the probability parameters of our model? one approach: maimum likelihood estimation given a set of data D set the parameters Pr( D ) to maimize i.e. make the data D look likely under the model

Maimum Likelihood Estimation suppose we want to estimate the parameters Pr(a), Pr(c), Pr(g), Pr(t) and we re given the sequences accgcgctta gcttagtgac tagccgttac then the maimum likelihood estimates are Pr( a) Pr( c) 6 30 9 30 0.2 0.3 7 Pr( g) 0.233 30 8 Pr( t) 0.267 30

Maimum Likelihood Estimation suppose instead we saw the following sequences gccgcgcttg gcttggtggc tggccgttgc then the maimum likelihood estimates are Pr( a) Pr( c) 0 30 9 30 0 0.3 13 Pr( g) 0.433 30 8 Pr( t) 0.267 30 do we really want to set this to 0?

A Bayesian Approach instead of estimating parameters strictly from the data, we could start with some prior belief for each for eample, we could use Laplace estimates where character i Pr( a) n i i na 1 ( n 1) i represents the number of occurrences of using Laplace estimates with the sequences gccgcgcttg gcttggtggc tggccgttgc Pr( a) Pr( c) 0 1 34 9 1 34 pseudocount

A Bayesian Approach a more general form: m-estimates Pr( a) n i a n i p a m m prior probability of a number of virtual instances with m=8 and uniform priors gccgcgcttg gcttggtggc tggccgttgc Pr( c) 9 0.258 30 8 11 38

Estimation for 1 st Order Probabilities to estimate a 1 st order parameter, such as Pr(c g), we count the number of times that c follows the history g in our given sequences using Laplace estimates with the sequences gccgcgcttg 0 1 0 1 Pr( a g) Pr( a c) 12 4 7 4 gcttggtggc 7 1 tggccgttgc Pr( c g) 12 4 3 1 Pr( g g) 12 4 2 1 Pr( t g) 12 4

Markov Chains for Discrimination Question: Given a short stretch of genomic sequence, how could we decide if it comes from a CpG island or not? given sequences from CpG islands, and sequences from other regions, we can construct a model to represent CpG islands a null model to represent the other regions can then score a test sequence by: score( ) log Pr( CpG model) Pr( null model)

Markov Chains for Discrimination why use score( ) log Bayes rule tells us Pr( CpG) Pr( null ) Pr( CpG)Pr( CpG) Pr( CpG ) Pr( ) Pr( null )Pr( null ) Pr( null ) Pr( ) if we re not taking into account prior probabilities of two classes ( Pr(CpG) and Pr(null) ) then we just need to compare Pr( CpG) and Pr( null )

where Pr( CpGmodel) and a + st is the transition probability for the CpG island model from letter s to letter t, which is set as a st c L1 and c + st is the number of times letter s followed by t. So the score can be further written as Pr( st c t' st' L Pr( Pr( 1, 1 ) ) L 2 L 2,..., Pr( a i1 i 1 i model) i1 ) score( ) log Pr( CpGmodel) Pr( null model) L i log a a i1 1 i1 i i

Markov Chains for Discrimination parameters estimated for CpG and null models human sequences containing 48 CpG islands 60,000 nucleotides + A C G T A.18.27.43.12 C.17.37.27.19 G.16.34.38.12 T.08.36.38.18 - A C G T A.30.21.28.21 C.32.30.08.30 G.25.24.30.21 T.18.24.29.29

Markov Chains for Discrimination light bars represent negative sequences (non-cpg island) dark bars represent positive sequences

Higher Order Markov Chains the Markov property specifies that the probability of a state depends only on the probability of the previous state but we can build more memory into our states by using a higher order Markov model in an nth order Markov model Pr( i i, i2,..., 1 ) Pr( i i 1,..., 1 in )