Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability measures how probable it is that D will occur rather than some other data value. After the experiment, data D are known, and P (D H), with D fixed and H varying, can be regarded as a measure of how likely it is that H is the true hypothesis, rather than another, after observing data D. When P (D H) is regarded as a function of H rather than D we call P (D H) the likelihood of H (if considering a single H) or the likelihood function (if considering all possible H). We regard H 1 as more compatible with the data than H 2 if P (D H 1 ) > P (D H 2 ). Usually H specifies the value of a parameter, or set of parameters, and the likelihood is the probability (or probability density) for the data values, considered as a function of the parameters of the distribution. It is more convenient to work with the (natural) log of the likelihood function than the likelihood itself. When comparing hypotheses or parameter values, only relative values of likelihood are important (ratios of likelihoods, or differences between log-likelihoods). Any multiplicative factor in the likelihood (or additive term in the log-likelihood) that does not include the parameter can be ignored. See examples below. The laws of probability apply when we regard P (D H) as a function of D. They do not apply when P (D H) is regarded as a function of H. Binomial distribution Imagine we perform a simple experiment (e.g. tossing a coin) n times, and each trial results in success (with probability p), or failure (with probability 1 p). Each success contributes a factor p to the likelihood, and each failure contributes a factor (1 p). The corresponding (additive) contributions to the log-likelihood are log p and log(1 p). If there are Y successes and n Y failures, the log likelihood is Y log p + (n Y ) log(1 p). The maximum value of the (log) likelihood occurs when p = Y/n, which we call the maximum likelihood or ML estimate of p. Poisson distribution If Y is Poisson with mean m, the log-likelihood for m is Y log m m. The ML estimate of m is Y. If we have a sample of size n from this distn, the log-likelihood is the sum of contributions from each sample value, ( Y i ) log m nm, and the ML estimate of m is Ȳ = ( Y i )/n. Multinomial distribution This is a generalization of the binomial distribution, in which each trial has k possible outcomes, with probabilities p 1 p k. In n independent trials the first outcome occurs Y 1 times, the second Y 2 times, etc. The binomial distn is the case k = 2. By an argument analogous to that used above for the binomial distribution, the log-likelihood is Y 1 log p 1 + Y 2 log p 2 +... + Y k log p k. With no restrictions on the probabilities, apart from satisfying p 1 + + p k = 1, the ML estimate of p i is Y i /n, for i = 1 to k. Multinomial applications of ML are more interesting when a null hypothesis imposes constraints on the probabilities. See below. 7
Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, 2 log L = n log υ + (1/υ)[S yy + n(ȳ m)2 ] The final term is minimized when m = Ȳ, so the ML estimate of m is the sample mean. Substituting this value for m leads to the profile likelihood for υ, n log υ + S yy /υ which is minimized when υ = S yy /n. The ML estimator in this case is biased, and this is generally true of ML estimators of variance components. The following argument is often used when estimating variance components. The value of S yy gives us information about υ, and if m is known, so does (Ȳ m)2. But if, as is usually the case, m is unknown, the information in (Ȳ m)2 is not available. Maximum likelihood estimation of υ should therefore be based on the probability distn of S yy alone (rather than the joint distn of S yy and Ȳ ). This argument leads to a modified (restricted, or residual) likelihood with 2 log L = (n 1) log υ + S yy /υ (based on the fact that the distn of S yy /υ is chi-squared with n 1 d.f.) The estimate obtained by maximizing the modified likelihood is called the REML estimate. Here the REML estimate is the sample variance. In previous examples the estimators could have been guessed without invoking ML. In the following examples the estimator is less obvious. Genetic linkage The progeny of two maize plants are categorized as starchy or sugary, with leaves which are either green or white. The combinations, starchy-green, starchy-white, sugary-green and sugary-white occur with probabilities (2 + θ), (1 θ), (1 θ), θ (Starchiness and leaf colour are determined at two genetic loci. The parameter θ is a measure of linkage between the two loci.) With frequencies Y 1 Y in the categories, the log-likelihood is const + Y 1 log(2 + θ) + (Y 2 + Y 3 ) log(1 θ) + Y log θ. The log-likelihood shown in slide... is based on the following data for 3839 progeny: Starchy Sugary Green White Green White Total 1997 906 90 32 3839 The ML estimate of θ is 0.0358±0.0060. See Q3 on the problem sheet. The standard error is calculated from the curvature of logl at the maximum. Usually there is no explicit expression for the ML estimate, and it has to be found by numerical optimization. This is an iterative process, based on successive refinements of an initial guess. 8
ABO blood group system In the ABO system, three alleles (A, B and O) at a single locus give rise to six genotypes and four phenotypes. Unknown frequencies for A, B, and O alleles are p, q and r (p + q + r = 1). Phenotype Genotype(s) Probability Observed frequency A AA, AO p 2 + 2pr n A B BB, BO q 2 + 2qr n B AB AB 2pq n AB O OO r 2 n O After simplification, the log-likelihood is (n A + n AB ) log p + (n B + n AB ) log q + n A log(p + 2r) + n B log(q + 2r) + 2n O log r This can be expressed as a function of just two parameters (p and q for example) by setting r = 1 p q. See question on the problem sheet. Likelihood ratio test The likelihood give us a general method for finding estimators of unknown parameters. It also provides a statistic for hypothesis testing. Suppose a model specifies that parameter vector ψ belongs to (lives in) a d dimensional space, so that there are d free parameters. For example, if we have multinomial data with k probabilities which sum to 1, d = k 1. The null hypothesis H 0 specifies that ψ is restricted to a subspace of dimension s. In other words, H 0 expresses ψ in terms of s < d free parameters (s = 0 when the vector of probabilities is completely specified by H 0 ). Denote by l c and l u the maximized log-likelihood with and without constraints. The likelihood ratio test statistic is 2(l u l c ). In large samples, the null distn of the LRT statistic is approximately chi-squared with d s d.f. For example, suppose we have sample counts of genotypes A 1 A 1, A 1 A 2, and A 2 A 2. In the population, genotype probabilities (or proportions) are p 1, p 2, and p 3. The vector parameter is ψ = (p 1, p 2, p 3 ). Because p 1 + p 2 + p 3 = 1, the number of free parameters is d = 3 1 = 2. The hypothesis of Hardy-Weinberg equilibrium states that p 1 = θ 2, p 2 = 2θ(1 θ), p 3 = (1 θ) 2, for some unspecified value of θ. Here d = 2, s = 1, d s = 1. A trivial example Take the unconstrained model to be that for the genetic linkage example above, and the constrained model to be H 0 : θ = θ 0, where θ 0 is a single specified value. Here d = 1, s = 0. In the figure on slide..., l u is the maximum value of the log-likelihood, and l c is the value at θ 0. The LRT statistic is twice the difference between these values. The null distn is chi-squared with d s = 1 d.f. H 0 is rejected at the 5% level if 2(l u l c ) > 3.8 The confidence interval shown in slide... is obtained as the set of all θ 0 values not rejected by this test. These satisfy the condition l c > l u 1.92. In the figure, the arbitrary constant in logl has been chosen so that the maximum value of logl is zero. 9
Binomial regression For i = 1... k, Y i is binomial with index n i and parameter p i. Associated with each observation Y i is the value of an explanatory variable X i : Generalized regression model for p i : ( ) pi log = b 0 + b 1 X i ( logit transform ) 1 p i which resembles a standard regression equation. The log-likelihood for parameters b 0 and b 1 (details omitted) is [Yi log p i + (n i Y i ) log(1 p i )] where p i is given by the regression equation above, p i = exp(b 0 + b 1 X i ) 1 + exp(b 0 + b 1 X i ) ( logistic function) Example: Samples of 50 people were taken at five different ages. Numbers in each group affected by a disease were counted. Incidence of the disease obviously increases with age. Age 20 35 5 55 70 Number 6 17 26 37 The slope estimate is ˆb 1 = 0.081 ± 0.0108. The fitted logistic curve is shown in slide... The value of the LRT statistic for testing b 1 = 0 is 81.83 with 1 d.f. Log-linear models Frequencies Y 1 Y k are assumed independently Poisson distributed with means m 1 m k. In a log-linear model, we assume that log m i = b 0 + b 1 X i +, or, equivalently, that m i = exp(b 0 + b 1 X i + ) but this assumption is not used in deriving the LRT. The log-likelihood is Y i log m i m i With no restrictions on means, the ML estimate of m i is Y i, and the unconstrained maximum log-likelihood is Yi log Y i Y i (unconstrained) Now consider null hypothesis H 0 which specifies a model with s < k parameters. Denote ML estimates under H 0 by ˆm 1 ˆm k. Maximum log-likelihood under H 0 is Yi log ˆm i ˆm i (constrained) 50
LRT statistic is twice the difference between unconstrained and constrained maxima: 2 Y i log(y i / ˆm i ) 2 (Y i ˆm i ) (Second term is zero if model includes an intercept, which it nearly always does.) In large samples, null distn of LRT is approximately chi-squared with k s d.f. When used with the multinomial, binomial or Poisson distn, the LRT statistic is called the residual deviance and denoted G 2. It is an alternative to the chi-squared goodness of fit statistic. X 2 = (O E) 2, G 2 = 2 O log O E E The two statistics have the same d.f., and are usually similar in magnitude. Generalized linear models The binomial regression and log-linear models are examples of a generalized linear model (GLM). In the linear (multiple regression) model, the response variable Y is assumed to be normally distributed with constant variance. The mean value E(Y ) is assumed to be related to predictor variables X 1, X 2,... E(Y ) = b 0 + b 1 X 1 + b 2 X 2 + The expression on the right-hand side is the linear predictor. With a GLM, a) other distns are allowed for Y (binomial, Poisson, exponential). b) var(y ) is allowed to depend on E(Y ). c) the linear predictor is related to a function of E(Y ) (the link function). Estimation of parameters is by ML. The linear predictor can include any mix of covariates and factors, just as for the multiple regression model. E.g. with Poisson data categorized by two factors we can analyse an association table. This provides an alternative to the chi-squared association test. Example: test for association The test for association is a test of proportionality of the frequencies in the two-way table, and this is equivalent to additivity on the log scale. An alternative to the chi-squared test assumes the table counts are Poisson distributed and fits the log-linear model: log m = (Intercept) + Row effect + Column effect For a table with r rows and c columns, the unrestricted number of parameters is rc, and the null hypothesis specifies an additive model with 1 + (r 1) + (c 1) parameters. The LRT test statistic therefore has (r 1)(c 1) d.f. Using R In R, simple ML calculations are performed with mle( ). Create a function to evaluate minus the log-likelihood, then pass this function together with starting values to mle( ). For the genetic linkage example, 51
minuslogl <- function(x = 0.1) { if (xˆ2 < x) -1997 * log(2 + x) - 1810 * log(1 - x) - 32 * log(x) else Inf } library(stats) fit <- mle(minuslogl) summary(fit) For the ABO example, the function has two arguments: na <- 212; nb <- 103; nab <- 39; no <- 19 minuslogl <- function(p, q) { r <- 1 - p - q -(na + nab) * log(p) -... } The mle( ) function is part of the stats package. Use library( ) just once to load the package. The mle( ) function is then available for the rest of your R session. # Binomial regression X <- c(20, 35, 5, 55, 70) R <- c(6, 17, 26, 37, ) N <- rep(50, 5) fit <- glm(r/n X, weight = N, family = binomial(link = "logit")) summary(fit) The output from summary(fit) gives two deviances: null deviance = 82.1 with d.f., residual deviance = 0.32 with 3 d.f. The difference between these, 81.82 with 1 d.f., is the LRT for the hypothesis b 1 = 0 (very significant, off the scale). An example of a 2 2 association table (tonsils data from week 3): freqs <- c(19, 53, 97, 829) carrier <- gl(2, 2, labels = c("yes", "no")) enlarged <- gl(2, 1,, labels = c("no", "yes")) fit <- glm(freqs carrier + enlarged, family = poisson(link = "log")) summary(fit) The residual deviance is the LRT equivalent to the chi-squared statistic. When using family = poisson, the chi-squared statistic can be obtained with sum(resid(fit, type = "pearson")ˆ2) and fitted values (same as for chi-squared test) are given by fitted(fit). 52