y = 1 N y i = 1 N y. E(W )=a 1 µ 1 + a 2 µ a n µ n (1.2) and its variance is var(w )=a 2 1σ a 2 2σ a 2 nσ 2 n. (1.

Size: px

Start display at page:

Download "y = 1 N y i = 1 N y. E(W )=a 1 µ 1 + a 2 µ a n µ n (1.2) and its variance is var(w )=a 2 1σ a 2 2σ a 2 nσ 2 n. (1."

Eileen Powers
5 years ago
Views:

1 The probability density function of a continuous random variable Y (or the probability mass function if Y is discrete) is referred to simply as a probability distribution and denoted by f(y; θ) where θ represents the parameters ofthe distribution. We use dot ( ) subscripts for summation and bars ( ) for means, thus y = 1 N N y i = 1 N y. The expected value and variance ofa random variable Y are denoted by E(Y ) and var(y ) respectively. Suppose random variables Y 1,..., Y N are independent with E(Y i )=µ i and var(y i )=σ 2 i for i =1,..., n. Let the random variable W be a linear combination ofthe Y i s W = a 1 Y 1 + a 2 Y a n Y n, (1.1) where the a i s are constants. Then the expected value of W is and its variance is E(W )=a 1 µ 1 + a 2 µ a n µ n (1.2) var(w )=a 2 1σ a 2 2σ a 2 nσ 2 n. (1.3) 1.4Distributions related to the Normal distribution The sampling distributions ofmany ofthe estimators and test statistics used in this book depend on the Normal distribution. They do so either directly because they are derived from Normally distributed random variables, or asymptotically, via the Central Limit Theorem for large samples. In this section we give definitions and notation for these distributions and summarize the relationships between them. The exercises at the end ofthe chapter provide practice in using these results which are employed extensively in subsequent chapters Normal distributions 1. Ifthe random variable Y has the Normal distribution with mean µ and variance σ 2, its probability density function is f(y; µ, σ 2 )= We denote this by Y N(µ, σ 2 ). [ 1 exp 1 2πσ 2 2 ( ) ] 2 y µ. 2. The Normal distribution with µ = 0 and σ 2 =1,Y N(0, 1), is called the standard Normal distribution. σ 2

2 3. Let Y 1,..., Y n denote Normally distributed random variables with Y i N(µ i,σ 2 i )fori =1,..., n and let the covariance of Y i and Y j be denoted by cov(y i,y j )=ρ ij σ i σ j, where ρ ij is the correlation coefficient for Y i and Y j. Then the joint distribution ofthe Y i s is the multivariate Normal distribution with mean vector µ =[µ 1,..., µ n ] T and variance-covariance matrix V with diagonal elements σi 2 and non-diagonal elements ρ ijσ i σ j for i j. We write this as y N(µ, V), where y=[y 1,..., Y n ] T. 4. Suppose the random variables Y 1,..., Y n are independent and Normally distributed with the distributions Y i N(µ i,σi 2 )fori =1,..., n. If W = a 1 Y 1 + a 2 Y a n Y n, where the a i s are constants. Then W is also Normally distributed, so that ( n n ) n W = a i Y i N a i µ i, a 2 i σi 2 by equations (1.2) and (1.3) Chi-squared distribution 1. The central chi-squared distribution with n degrees offreedom is defined as the sum ofsquares ofn independent random variables Z 1,..., Z n each with the standard Normal distribution. It is denoted by n X 2 = Zi 2 χ 2 (n). In matrix notation, if z =[Z 1,..., Z n ] T then z T z = n Z2 i so that X 2 = z T z χ 2 (n). 2. If X 2 has the distribution χ 2 (n), then its expected value is E(X 2 )=n and its variance is var(x 2 )=2n. 3. If Y 1,..., Y n are independent Normally distributed random variables each with the distribution Y i N(µ i,σi 2) then n ( ) 2 X 2 Yi µ i = χ 2 (n) (1.4) because each ofthe variables Z i =(Y i µ i ) /σ i has the standard Normal distribution N(0, 1). 4. Let Z 1,..., Z n be independent random variables each with the distribution N(0, 1) and let Y i = Z i + µ i, where at least one ofthe µ i s is non-zero. Then the distribution of Y 2 i = (Z i + µ i ) 2 = Zi 2 +2 Z i µ i + µ 2 i σ i

3 has larger mean n + λ and larger variance 2n +4λ than χ 2 (n) where λ = µ 2 i. This is called the non-central chi-squared distribution with n degrees offreedom and non-centrality parameter λ. It is denoted by χ 2 (n, λ). 5. Suppose that the Y i s are not necessarily independent and the vector y = [Y 1,...,Y n ] T has the multivariate normal distribution y N(µ, V) where the variance-covariance matrix V is non-singular and its inverse is V 1. Then X 2 =(y µ) T V 1 (y µ) χ 2 (n). (1.5) 6. More generally if y N(µ, V) then the random variable y T V 1 y has the non-central chi-squared distribution χ 2 (n, λ) where λ = µ T V 1 µ. 7. If X1 2,...,Xm 2 are m independent random variables with the chi-squared distributions Xi 2 χ2 (n i,λ i ), which may or may not be central, then their sum also has a chi-squared distribution with n i degrees offreedom and non-centrality parameter λ i, i.e., ( m m ) m Xi 2 χ 2 n i, λ i. This is called the reproductive property ofthe chi-squared distribution. 8. Let y N(µ, V), where y has n elements but the Y i s are not independent so that V is singular with rank k<nand the inverse of V is not uniquely defined. Let V denote a generalized inverse of V. Then the random variable y T V y has the non-central chi-squared distribution with k degrees of freedom and non-centrality parameter λ = µ T V µ. For further details about properties of the chi-squared distribution see Rao (1973, Chapter 3) t-distribution The t-distribution with n degrees offreedom is defined as the ratio oftwo independent random variables. The numerator has the standard Normal distribution and the denominator is the square root ofa central chi-squared random variable divided by its degrees offreedom; that is, Z T = (1.6) (X 2 /n) 1/2 where Z N(0, 1), X 2 χ 2 (n) and Z and X 2 are independent. This is denoted by T t(n) F-distribution 1. The central F-distribution with n and m degrees offreedom is defined as the ratio oftwo independent central chi-squared random variables each

4 divided by its degrees offreedom, F = X2 1 n /X2 2 m (1.7) where X 2 1 χ 2 (n),x 2 2 χ 2 (m) and X 2 1 and X 2 2 are independent. This is denoted by F F (n, m). 2. The relationship between the t-distribution and the F-distribution can be derived by squaring the terms in equation (1.6) and using definition (1.7) to obtain T 2 = Z2 1 /X2 F (1,n), (1.8) n that is, the square ofa random variable with the t-distribution, t(n), has the F-distribution, F (1,n). 3. The non-central F-distribution is defined as the ratio oftwo independent random variables, each divided by its degrees offreedom, where the numerator has a non-central chi-squared distribution and the denominator has a central chi-squared distribution, i.e., F = X2 1 n /X2 2 m where X 2 1 χ 2 (n, λ) with λ = µ T V 1 µ,x 2 2 χ 2 (m) and X 2 1 and X 2 2 are independent. The mean ofa non-central F-distribution is larger than the mean ofcentral F-distribution with the same degrees offreedom. 1.5 Quadratic forms 1. A quadratic form is a polynomial expression in which each term has degree 2. Thus y1 2 + y2 2 and 2y1 2 + y2 2 +3y 1 y 2 are quadratic forms in y 1 and y 2 but y1 2 + y2 2 +2y 1 or y1 2 +3y are not. 2. Let A be a symmetric matrix a 11 a 12 a 1n a 21 a 22 a 2n..... a n1 a n2 a nn where a ij = a ji, then the expression y T Ay = i j a ijy i y j is a quadratic form in the y i s. The expression (y µ) T V 1 (y µ) is a quadratic form in the terms (y i µ i ) but not in the y i s. 3. The quadratic form y T Ay and the matrix A are said to be positive definite if y T Ay > 0 whenever the elements of y are not all zero. A necessary and sufficient condition for positive definiteness is that all the determinants A 1 = a 11, A 2 = a 11 a 12 a 21 a 22, A a 11 a 12 a 13 3 = a 21 a 22 a 23,..., and a 31 a 32 a 33

5 A n = det A are all positive. 4. The rank ofthe matrix A is also called the degrees offreedom ofthe quadratic form Q = y T Ay. 5. Suppose Y 1,..., Y n are independent random variables each with the Normal distribution N(0,σ 2 ). Let Q = n Y i 2 and let Q 1,..., Q k be quadratic forms in the Y i s such that Q = Q Q k where Q i has m i degrees offreedom (i =1,...,k). Then Q 1,..., Q k are independent random variables and Q 1 /σ 2 χ 2 (m 1 ),Q 2 /σ 2 χ 2 (m 2 ), and Q k /σ 2 χ 2 (m k ), ifand only if, m 1 + m m k = n. This is Cochran s theorem; for a proof see, for example, Hogg and Craig (1995). A similar result holds for non-central distributions; see Chapter 3 ofrao (1973). 6. A consequence ofcochran s theorem is that the difference oftwo independent random variables, X1 2 χ 2 (m) and X2 2 χ 2 (k), also has a chi-squared distribution provided that X 2 0 and m>k. X 2 = X 2 1 X 2 2 χ 2 (m k) 1.6 Estimation Maximum likelihood estimation Let y =[Y 1,..., Y n ] T denote a random vector and let the joint probability density function of the Y i s be f(y; θ) which depends on the vector ofparameters θ =[θ 1,..., θ p ] T. The likelihood function L(θ; y) is algebraically the same as the joint probability density function f(y; θ) but the change in notation reflects a shift ofemphasis from the random variables y, with θ fixed, to the parameters θ with y fixed. Since L is defined in terms ofthe random vector y, it is itselfa random variable. Let Ω denote the set ofall possible values ofthe parameter vector θ; Ω is called the parameter space. The maximum likelihood estimator of θ is the value θ which maximizes the likelihood function, that is L( θ; y) L(θ; y) for all θ in Ω. Equivalently, θ is the value which maximizes the log-likelihood function

6 l(θ; y) = log L(θ; y), since the logarithmic function is monotonic. Thus l( θ; y) l(θ; y) for all θ in Ω. Often it is easier to work with the log-likelihood function than with the likelihood function itself. Usually the estimator θ is obtained by differentiating the log-likelihood function with respect to each element θ j of θ and solving the simultaneous equations l(θ; y) θ j = 0 for j =1,..., p. (1.9) It is necessary to check that the solutions do correspond to maxima of l(θ; y) by verifying that the matrix of second derivatives 2 l(θ; y) θ j θ k evaluated at θ = θ is negative definite. For example, if θ has only one element θ this means it is necessary to check that [ 2 ] l(θ, y) θ 2 < 0. It is also necessary to check ifthere are any values ofθ at the edges ofthe parameter space Ω that give local maxima of l(θ; y). When all local maxima have been identified, the value of θ corresponding to the largest one is the maximum likelihood estimator. (For most ofthe models considered in this book there is only one maximum and it corresponds to the solution ofthe equations l/ θ j =0,j =1,..., p.) An important property ofmaximum likelihood estimators is that ifg(θ) is any function of the parameters θ, then the maximum likelihood estimator of g(θ) isg( θ). This follows from the definition of θ. It is sometimes called the invariance property ofmaximum likelihood estimators. A consequence is that we can work with a function of the parameters that is convenient for maximum likelihood estimation and then use the invariance property to obtain maximum likelihood estimates for the required parameters. In principle, it is not necessary to be able to find the derivatives ofthe likelihood or log-likelihood functions or to solve equation (1.9) if θ can be found numerically. In practice, numerical approximations are very important for generalized linear models. Other properties ofmaximum likelihood estimators include consistency, sufficiency, asymptotic efficiency and asymptotic normality. These are discussed in books such as Cox and Hinkley (1974) or Kalbfleisch (1985, Chapters 1 and 2). θ= θ

7 1.6.2 Example: Poisson distribution Let Y 1,..., Y n be independent random variables each with the Poisson distribution e θ f(y i ; θ) = θyi, y i =0, 1, 2,... y i! with the same parameter θ. Their joint distribution is f(y 1,...,y n ; θ) = n e θ f(y i ; θ) = θy1 y 1! θy2 e θ y 2! θyn e θ y n! = θσ yi e nθ y 1!y 2!...y n!. This is also the likelihood function L(θ; y 1,..., y n ). It is easier to use the loglikelihood function l(θ; y 1,..., y n ) = log L(θ; y 1,..., y n )=( y i ) log θ nθ (log y i!). To find the maximum likelihood estimate θ, use dl dθ = 1 yi n. θ Equate this to zero to obtain the solution θ = yi /n = y. Since d 2 l/dθ 2 = y i /θ 2 < 0, l has its maximum value when θ = θ, confirming that y is the maximum likelihood estimate Least Squares Estimation Let Y 1,..., Y n be independent random variables with expected values µ 1,..., µ n respectively. Suppose that the µ i s are functions of the parameter vector that we want to estimate, β =[β 1,..., β p ] T,p<n.Thus E(Y i )=µ i (β). The simplest form of the method of least squares consists offinding the estimator β that minimizes the sum ofsquares ofthe differences between Y i s and their expected values S = [Y i µ i (β)] 2. Usually β is obtained by differentiating S with respect to each element β j of β and solving the simultaneous equations S =0, j =1,..., p. β j Ofcourse it is necessary to check that the solutions correspond to minima

8 (i.e., the matrix ofsecond derivatives is positive definite) and to identify the global minimum from among these solutions and any local minima at the boundary ofthe parameter space. Now suppose that the Y i s have variances σi 2 that are not all equal. Then it may be desirable to minimize the weighted sum ofsquared differences S = w i [Y i µ i (β)] 2 where the weights are w i =(σ 2 i ) 1. In this way, the observations which are less reliable (that is, the Y i s with the larger variances) will have less influence on the estimates. More generally, let y =[Y 1,..., Y n ] T denote a random vector with mean vector µ =[µ 1,..., µ n ] T and variance-covariance matrix V. Then the weighted least squares estimator is obtained by minimizing S =(y µ) T V 1 (y µ) Comments on estimation. 1. An important distinction between the methods ofmaximum likelihood and least squares is that the method ofleast squares can be used without making assumptions about the distributions ofthe response variables Y i beyond specifying their expected values and possibly their variance-covariance structure. In contrast, to obtain maximum likelihood estimators we need to specify the joint probability distribution of the Y i s. 2. For many situations maximum likelihood and least squares estimators are identical. 3. Often numerical methods rather than calculus may be needed to obtain parameter estimates that maximize the likelihood or log-likelihood function or minimize the sum ofsquares. The following example illustrates this approach Example: Tropical cyclones Table 1.2 shows the number oftropical cyclones in Northeastern Australia for the seasons (season 1) to (season 13), a period of fairly consistent conditions for the definition and tracking of cyclones (Dobson and Stewart, 1974). Table 1.2 Numbers of tropical cyclones in 13 successive seasons. Season: No. ofcyclones Let Y i denote the number ofcyclones in season i, where i =1,...,13. Suppose the Y i s are independent random variables with the Poisson distribution

9 Figure 1.1 Graph showingthe location of the maximum likelihood estimate for the data in Table 1.2 on tropical cyclones. with parameter θ. From Example θ = y =72/13= An alternative approach would be to find numerically the value of θ that maximizes the log-likelihood function. The component of the log-likelihood function due to y i is l i = y i log θ θ log y i!. The log-likelihood function is the sum of these terms l = 13 l i = 13 (y i log θ θ log y i!). Only the first two terms in the brackets involve θ and so are relevant to the optimization calculation, because the term 13 1 log y i! is a constant. To plot the log-likelihood function (without the constant term) against θ, for various values of θ, calculate (y i logθ θ) for each y i and add the results to obtain l = (y i logθ θ). Figure 1.1 shows l plotted against θ. Clearly the maximum value is between θ = 5 and θ =6. This can provide astarting point for an iterative procedure for obtaining θ. The results of asimple bisection calculation are shown in Table 1.3. The function l is first calculated for approximations θ (1) = 5 and θ (2) = 6. Then subsequent approximations θ (k) for k =3, 4,... are the average values ofthe two previous estimates of θ with the largest values of l (for example, θ (6) = 1 2 (θ(5) + θ (3) )). After 7 steps this process gives θ 5.54 which is correct to 2 decimal places. 1.7 Exercises 1.1 Let Y 1 and Y 2 be independent random variables with Y 1 N(1, 3) and Y 2 N(2, 5). If W 1 = Y 1 +2Y 2 and W 2 =4Y 1 Y 2 what is the joint distribution of W 1 and W 2? 1.2 Let Y 1 and Y 2 be independent random variables with Y 1 N(0, 1) and Y 2 N(3, 4).

10 Table 1.3 Successive approximations to the maximum likelihood estimate of the mean number of cyclones per season. k θ (k) l (a) What is the distribution of Y1 2? [ ] Y (b) If y = 1, obtain an expression for y (Y 2 3)/2 T y.what is its distribution? ( ) Y1 (c) If y = and its distribution is y N(µ, V), obtain an expression Y 2 for y T V 1 y. What is its distribution? 1.3 Let the joint distribution of Y 1 and Y 2 be N(µ, V) with ( ) ( ) µ = and V = (a) Obtain an expression for (y µ) T V 1 (y µ). What is its distribution? (b) Obtain an expression for y T V 1 y. What is its distribution? 1.4 Let Y 1,..., Y n be independent random variables each with the distribution N(µ, σ 2 ). Let Y = 1 n Y i and S 2 = 1 n (Y i Y ) 2. n n 1 (a) What is the distribution of Y? (b) Show that S 2 = 1 [ n n 1 (Y i µ) 2 n(y µ) 2]. (c) From (b) it follows that (Y i µ) 2 /σ 2 =(n 1)S 2 /σ 2 + [ (Y µ) 2 n/σ 2]. How does this allow you to deduce that Y and S 2 are independent? (d) What is the distribution of(n 1)S 2 /σ 2? (e) What is the distribution of Y µ S/ n?

11 Table 1.4 Progeny of light brown apple moths. Progeny Females Males group This exercise is a continuation ofthe example in Section in which Y 1,..., Y n are independent Poisson random variables with the parameter θ. (a) Show that E(Y i )=θ for i =1,..., n. (b) Suppose θ = e β. Find the maximum likelihood estimator of β. (c) Minimize S = ( Y i e β) 2 to obtain a least squares estimator of β. 1.6 The data below are the numbers offemales and males in the progeny of 16 female light brown apple moths in Muswellbrook, New South Wales, Australia (from Lewis, 1987). (a) Calculate the proportion offemales in each ofthe 16 groups ofprogeny. (b) Let Y i denote the number offemales and n i the number ofprogeny in each group (i = 1,..., 16). Suppose the Y i s are independent random variables each with the binomial distribution ( ) ni f(y i ; θ) = θ yi (1 θ) ni yi. y i Find the maximum likelihood estimator of θ using calculus and evaluate it for these data. (c) Use a numerical method to estimate θ and compare the answer with the one from (b).

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain 0.1. INTRODUCTION 1 0.1 Introduction R. A. Fisher, a pioneer in the development of mathematical statistics, introduced a measure of the amount of information contained in an observaton from f(x θ). Fisher