Probability and Estimation. Alan Moses

Probability and Estimation Alan Moses

Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic. Describes events (e.g., coin toss is heads) In practice, all observed data can be thought of as an event (e.g., expression level of p53 is 10748.42) Probability theory is a mathematical way of understanding (and predicting) the stochastic variability

Why did probability theory originate?

laws of probability Say X is a random variable and P(X=A) is the probability of event A, (often written simply as P(X) ) Σ A P(X=A) = 1 Say Y is another random variable, P(X=A,Y=32.7) is the joint probability of events A and 32.7 (often written simply as P(X,Y)) Σ Σ Y X P(X,Y) = 1 If X and Y are independent, the joint distribution can be factored or Σ X P(X=A, Y=32.7) = P(X=A)P(Y=32.7) P(X) = 1 or P(X,Y) = P(X)P(Y)

More laws of probability More generally, if X 1 X N are a series of independent random variables, i=n P(X 1 X N ) = Π P(X i ) i=1 P(X=A Y=32.7) is the conditional probability of event A given that event 32.7 already happened P(X=A, Y=32.7) = P(X=A Y=32.7)P(Y=32.7) or P(X,Y) = P(X Y)P(Y)

Exercises: Proove Bayes theorem: P(X Y) = P(Y X) Solve: Σ X Σ Y P(X Y) =??? P(X Y) =??? P(X) P(Y) Show that the Poisson is correctly normalized, if P(X λ) is the Poisson pdf X= Σ X=0 P(X λ) = 1

Probabilistic models In biology, the Truth is usually unknown and very complicated We only will consider the data we have and our ability to make a model of it We don t think about whether our model is True or Correct, but only how well it fits the data We accept that a better model will always be possible Realistic (=complicated) models need more data that simple ones, so we usually try to choose a simple on

Probabilistic models Probability distributions are mathematical objects that include several functions. pdf (probability distribution function) cdf (cumulative distribution function) Probability generating function Moment generating function Distributions can be characterized by their moments (1 st and 2 nd moments are mean and variance ) What distribution describes my data? Continuous vs. discrete vs. ordinal distributions

Bernoulli, binomial, multinomial Major family of distributions for discrete events Bernoulli describes binary outcomes (heads/tails) in a single trial, based on a single parameter, say f P(X f) P(X=1) = f P(X=0) = 1 - f Binomial describes the number of positive outcomes in a series of N Bernoulli trials i=n say Y = Σ X i i=1 Y is another random variable, whose distribution is P(Y f,n) = ( N ) Y f Y (1 - f ) N - Y Let s derive the Binomial pdf

Moments 1 st moment is the mean or expectation E[X] = Σ P(X=A)A or A E[X] = 2 nd moment is the variance V[X] = P(X)(X E[X]) 2 = E[(X E[X]) 2 ] Σ X Σ P(X)X X What are the mean and variance of the Bernoulli?

Multivariate generalization What if there are more than 2 possibilities? Multinomial is the generalization of the binomial to the case of more than 2 possibilities. E.g., for DNA. The sequence ACGT might be written as: X 1 = (1, 0, 0, 0) X 2 = (0, 1, 0, 0) X 3 = (0, 0, 1, 0) X 4 = (0, 0, 0, 1) A distribution P(X f) now needs f = (f A, f C, f G, f T ) dimensions are not independent, this can be quantified by correlation or covariance

Gaussian/Normal Major distribution for continuous events Gaussian describes the probability of observing real numbers between - and, in terms of two parameters, µ, σ 1 P(X µ,σ) = 2πσ 2 P(X µ,σ) dx = 1 Because of the very special Gaussian integral e (X µ) 2 2σ 2 May be due to Laplace in 1782

Why the Gaussian? What s so normal about it? Why do so many measurements in the real world follow such an obscure mathematical formula? E.g., Binomial converges to a Gaussian as N becomes large N=20 f=0.5 Why are random errors usually Gaussian?

Gaussian distribution has very special moments 1 st moment is the mean or expectation E[X] = P(X µ,σ) X X dx = µ 2 nd moment is the variance V[X] = P(X µ,σ) X (X µ) 2 dx = σ 2 For this reason, the parameters of the Gaussian are named mean and standard deviation.

Moments of continuous distributions In general, the mean and variance are functions of the parameters of the distribution, but not necessarily simple ones E.g., Gumbel or Extreme Value Distribution (used for for BLAST statistics) has pdf P(X µ,σ) = 1 σ µ X e σ e µ X σ E[X] = µ + σγ, where γ = 0.577 (= Euler s constant) V[X] = π 2 σ 2 6

Multivariate Gaussian Now each observation is a vector, say X 1 = (1.3, 4,6) The mean is a vector, µ = (µ 1, µ 2 ) The variance is a matrix Σ = V 11 V 12 V 12 V 22 The diagonal elements are the variances in each dimension The off diagonal elements are the covariances which summarize the dependence between the dimensions

Multivariate Gaussian

Parameter estimation Given some data, and some probabilistic model, how do I infer the parameters? The technical name for this is estimation. Several methods exist, each yielding estimators Least squares methods (NWLS, MMSE) Maximum likelihood (ML) Maximum a posteriori probability (MAP) How are estimators evaluated? consistency bias efficiency

Likelihood and MLEs Likelihood is the probability of the data (say X) given certain parameters (say θ) L = P(X θ) Maximum likelihood estimation says: choose θ, so that the data is most probable. L = 0 θ In practice there are many ways to maximize the likelihood.

Example of ML estimation Data: X i 5.2 9.1 8.2 7.3 7.8 P(X i µ=6.5, σ=1.5) 0.182737304 0.059227322 0.13996368 0.230761096 0.182737304 L = P(X θ) = P(X 1 X N θ) i=5 = Π P(X i µ=6.5, σ=1.5) = 6.39 x 10-5 i=1 L Mean, µ

Example of ML estimation In practice, we almost always use the log likelihood, which becomes a very large negative number when there is a lot of data Mean, µ Log(L)

Log(L) Example of ML estimation

ML Estimation In general, the likelihood is a function of multiple variables, so the derivatives with respect to all of these should be zero at a maximum In the example of the Gaussian, we have two parameters, so that L = 0 µ and L = 0 σ In general, finding MLEs means solving a set of coupled equations, which usually have to be solved numerically for complex models.

MLEs for the Gaussian 1 1 µ ML = Σ X V ML = Σ (X - µ ML ) 2 N X The Gaussian is the symmetric continuous distribution that has as its centre a parameter given by what we consider the average. The MLE for the for variance of the Gaussian is like the squared error from the mean, but is actually a biased (but still consistent!?) estimator N X

Let s derive the MLEs for the Binomial distribution

Properties of MLEs MLEs are asymptotically normal The mean of the MLE is the parameter you are trying to estimate, i.e., E[θ ML ]=θ The variance of the MLE is given by: V[θ ML ] = 2 E[- logl θ ML ] -1 2 θ Often written as V[θ ML ] = I -1, where the E[ ] is called the Fisher Information

Likelihood and MLEs In general, the likelihood function might be too complicated to calculate the MLEs analytically MLEs can still be obtained by maximizing the likelihood using numerical methods. Greedy: Gradient Ascent, Newton s Method Sampling methods: Gibbs, Metropolis The likelihood could have multiple maxima that make optimization difficult The more parameters the likelihood has, the more difficult the optimization problem

MAP estimation We can then write the posterior probability of the model, given some data. We can compute the posterior using Bayes theorem Where posterior = P(θ X) posterior = P(X θ) P(X) = Σ θ MAP estimate says: maximize the posterior P(θ) is a prior distribution P(θ) P(X) P(X θ)p(θ) = L Don t usually need to integrate over all the parameters weighting the likelihood by prior beliefs P(θ) P(X)

Bayesian methods Do we really need estimators for all parameters? Nuisance parameters Instead, use the whole posterior distribution Use its moments, quantiles, etc. Integrate the function (e.g., squared error) that you care about over the posterior directly E.g., mean of the posterior distribution: E[θ ] = Σ P(θ X)θ 1 1 θ Now we do need to integrate over the prior distribution

Conjugate priors special mathematically convenient prior distributions that make integrations possible/easier Priors are distributions on the parameters. Often use exotic distributions rarely needed for real data The parameters of these distributions are called hyperparameters and also have to be estimated or integrated away E.g., for a Bernouli, the parameter is a continuous value between 0 and 1 (the prior is a Beta) E.g., for a Gaussian, the prior on the mean is a Gaussian, but the prior on the standard deviation (which is always >0) is exotic (inverse Gamma?)