Lecture Stat Information Criterion

Lecture Stat 461-561 Information Criterion Arnaud Doucet February 2008 Arnaud Doucet () February 2008 1 / 34

Review of Maximum Likelihood Approach We have data X i i.i.d. g (x). We model the distribution of these data by a parametric model ff (xj θ) ; θ 2 Θ R p g. We estimate θ by ML; that is bθ = arg maxl (θ) where l (θ) := n k=1 log f (X k j θ) θ2θ We know that as n! then bθ converges towards θ = arg minkl (g (x) ; f (xj θ)) θ2θ where Z KL (g (x) ; f (xj θ)) := g (x) log g (x) f (xj θ) dx Arnaud Doucet () February 2008 2 / 34

A Central Limit Theorem Assuming the MLE is asymptotically consistent, then we have l (θ) 0 = l (θ) = θ θ + 2 l (θ) b θ θ θ T θ θ +... θ The law of large numbers yields 1 l (θ) n θ θ T! J (θ ) := θ The CLT yields where b θ p n 1 n Z I (θ ) = g (x) l (θ) θ θ Z log f (xj θ) θ g (x) 2 log f (xj θ) θ θ T D! N (0, I (θ )) θ log f (xj θ) θ T θ θ dx dx Arnaud Doucet () February 2008 3 / 34

Using Slutzky s theorem we have p n b θ θ D! N 0, J 1 (θ ) I (θ ) J 1 (θ ). This is sometimes known as the sandwitch variance. When g (x) = f (xj θ ) then we have J (θ ) = I (θ ) and the standard CLT follows p n b θ θ D! N 0, I 1 (θ ) Arnaud Doucet () February 2008 4 / 34

Evaluating The Statistical Model We want to compare our model f xj b θ to g (x). One way consists of evaluating KL g (x) ; f xj b Z θ = g (x) log g (x) dx and the larger R g (x) log f true one. The crucial issue is to obtain an estimate of E g hlog f X j b i Z θ = g (x) log f Z g (x) log f xj b θ dx. xj b θ dx, the closer the model is to the xj b θ dx. Clearly an estimate of E g hlog f X j b i θ is n 1 l b θ and an estimate of ne g hlog f X j b i θ is l b θ. Arnaud Doucet () February 2008 5 / 34

Selecting a Model In practice, it is di cult to precisely capture the true structure of given phenomena from a limited number of data. Often several candidate statistical models are selected and we want to select the model that closely approximates the true distribution of the data. Given m candidate models ff i (xj θ i ) ; i = 1,..., mg and the associated ML estimates bθ i, a simple solution would consist of selecting the model j where j = arg max l i b θ i ; i2f1,...,mg i.e. picking the model for which KL g (x) ; f i xj b θ i is the smallest. Arnaud Doucet () February 2008 6 / 34

Be Careful! This approach does not provide a fair comparison of models. The quantity l i b θ i contains a bias as an estimator of h ne g log f i X j b θ i i. The reason is that the data X 1,..., X n are used twice: to estimate bθ i and to approximate the expectation with respect to g. We will show that the size of the bias depends on the size p of the parameter θ. Arnaud Doucet () February 2008 7 / 34

Relationship between log-likelihood and expected log-likelihood Let us denote θi = arg min KL (g (x) ; f i (xj θ i )) the true parameter. We necessarily have h E g log f i X j b i θ i E g [log f i (X j θi )] whereas l i b θ i l i (θ i ). To correct for this bias, we want to estimate h h b i (g) = E g l i b θ i ne g log f i X j b ii θ i where remember that and bθ i = bθ i (X 1:n ). l i b θ i = n k=1 log f i X k j b θ i Arnaud Doucet () February 2008 8 / 34

Information Criterion The information criterion for model i is de ned as IC (i) = 2 n k=1 log f i X k j b θ i + 2 festimator for b i (g)g IC(i) is a biased-corrected estimate of minus the expected log-likelihood. So given a collection of models, we will select the model j = arg min IC (i) i2f1,...,mg Arnaud Doucet () February 2008 9 / 34

Derivation of Bias We want to estimate the bias b (g) which is given by E g hl b θ (X 1:n ) ne g hlog f i = E g hl b θ (X 1:n ) l (θ ) {z } D 1 +E g [l (θ ) ne g [log f (X j θ )]] {z } D 2 h +E g X j b ii θ (X 1:n ) ne g [log f (X j θ )] ne g hlog f X j b ii θ (X 1:n ) {z } D 3 Arnaud Doucet () February 2008 10 / 34

Calculation of second term We have D 2 = E g [l (θ ) ne g [log f (X j θ )]] h i n = E g k=1 log f (X k j θ ) ne g [log f (X j θ )] = 0 No bias for this term... Arnaud Doucet () February 2008 11 / 34

Calculation of third term Let us denote h η b θ = E g log f X j b i θ. By performing a Taylor expansion around θ, we obtain η b θ = η (θ η (θ) T ) + b θ θ θ + 1 b θ θ T 2 η (θ) θ 2 θ θ T = 0 then θ As η(θ) θ where J (θ ) = η b θ η (θ ) 2 η (θ) θ θ T = θ 1 b θ θ T J (θ ) b θ θ 2 Z g (x) 2 log f (xj θ) θ θ T θ θ dx b θ θ Arnaud Doucet () February 2008 12 / 34

It follows that i D 3 = E g hn η (θ ) η b θ n b 2 E g θ θ T J (θ ) b θ θ = n 2 E g tr J (θ ) b θ θ bθ θ T = n b 2 tr J (θ ) E g θ θ bθ θ T We have from the CLT established previously we have E g b θ θ bθ θ T 1 n J (θ ) 1 I (θ ) J (θ ) 1 so D 3 1 2 tr n I (θ ) J (θ ) 1o Arnaud Doucet () February 2008 13 / 34

Calculation of rst term We have l (θ) = l b θ so + l (θ) θ T bθ θ bθ + 1 θ 2 bθ T 2 l (θ) θ θ T l (θ ) l b θ n T θ bθ J (θ ) θ 2 b θ θ bθ bθ +... D 1 = E g hl b θ (X 1:n ) n 2 E g θ bθ = n 2 E g i l (θ ) T J (θ ) θ T J (θ ) tr θ bθ θ = 1 2 tr I (θ ) J (θ ) 1 bθ bθ Arnaud Doucet () February 2008 14 / 34

Estimate of the Bias We have b (g) = D 1 + D 2 + D 3 1 n 2 tr I (θ ) J (θ ) 1o + 0 + 1 n 2 tr I (θ ) J (θ ) 1o = tr ni (θ ) J (θ ) 1o Let bi and bj be consistent estimate of I (θ ) and J (θ ), say bi = 1 log f (X k j θ) n n log f (X k j θ) k=1 θ b θ θ T, b θ 1 bj = 2 log f (X k j θ) n n k=1 θ θ T then we can estimate the bias through bb (g) = tr b I bj 1 b θ Arnaud Doucet () February 2008 15 / 34

Information Criterion We have for the information criterion for model i IC (i) = 2 n k=1 log f i X k j b θ i + 2tr b I i bj i 1 Assuming g (x) = f i (xj θi ) then I i (θi ) = J i (θi ) 1 so b i (g) = p i (where p i is the dimension of θ i ) so in this case AIC (i) = 2 n k=1 log f i X k j b θ i + 2p i. Arnaud Doucet () February 2008 16 / 34

Checking the Equality of Two Discrete Distributions Assume we have two sets of data each having k categories and Category 1 2 k Data set 1 n 1 n 2 n k Data set 2 m 1 m 2 m k where we have n 1 + + n k = n observations for the rst dataset and m 1 + + m k = m for the second. We assume these datasets follow the multinomial distributions p (n 1,, n k j p 1,, p k ) = p (m 1,, m k j q 1,, q k ) = n! n 1! n k! pn 1 1 pn k k, m! m 1! m k! qm 1 1 q m k k We want to check whether p 1,, p k 6= q 1,, q k or p 1,, p k = q 1,, q k Arnaud Doucet () February 2008 17 / 34

Assume the two distributions are di erent then l (p 1:k, q 1:k ) = log n! We have the MLE So l (bp 1:k, bq 1:k ) = C + with C = log n! + log m! AIC = 2 C + k j=1 + log m! k (log n j! + n j log p j ) j=1 bp j = n j n, bq j = m j n k j=1 k (log m j! + m j log q j ) j=1 n j log n j n + m j log m j n k j=1 (log n j! + log m j!) and n j log n j n + m j log m! j + 4 (k 1) n Arnaud Doucet () February 2008 18 / 34

If we assume the two distributions are equal then l (r) = C k (n j + m j ) log r j j=1 and We have l (br 1:k ) = C + br j = n j + m j n + m k j=1 nj + m j (n j + m j ) log n + m and AIC = 2 C + k j=1 nj +! m j (n j + m j ) log + 2 (k 1) n + m Arnaud Doucet () February 2008 19 / 34

Example Consider the following data Category 1 2 3 4 5 Data set 1 304 800 400 57 323 Data set 2 174 509 362 80 214 From this table we can deduce Category 1 2 3 4 5 bp j 0.16 0.42 0.21 0.03 0.17 bq j 0.13 0.38 0.27 0.06 0.16 br j 0.15 0.41 0.24 0.04 0.17 Arnaud Doucet () February 2008 20 / 34

Ignoring the constant C, the maximum log-likelihood of the models are Model 1 : Model 2 : k n j log n j j=1 k (n j + m j ) log j=1 n + m j log m j = n 4567.36, nj + m j = n + m 4585.61. The number of free parameters of the models is 2 (k 1) = 8 in model 1 and 4 in model 2. So it follows that AIC for model 1 and model 2 is respectively 9150,73 and 9179,22 so AIC selects Model 1. Arnaud Doucet () February 2008 21 / 34

Determining the number of bin size of an histogram Histograms are use for representing the properties of a set of observations obtained from either a discrete or continuous distribution. Assume we have a histogram fn 1, n 2,..., n k g where k is refer to as the bin size. If k is too large or too small, then it is di cult to capture the characteristics of the true distribution. We can think of an histogram as a model speci ed by a multinomial distribution p (n 1,, n k j p 1,, p k ) = where n 1 + + n k = n, p 1 + + p k = 1. n! n 1! n k! pn 1 1 pn k k Arnaud Doucet () February 2008 22 / 34

In this case, we have seen that the MLE is bp j = n j n l (bp 1:k ) = C + with C = log n! k j=1 log n j! We have k 1 free parameters so ( AIC = 2 C + k j=1 k j=1 n j log n j n n j log n j n ) and + 2 (k 1). We want to compare this model to a model with lower resolutions. Arnaud Doucet () February 2008 23 / 34

To compare the histogram model with a simpler one, we may assume p 2j 1 = p 2j for j = 1,...m and for sake of simplicity, we consider here k = 2m. In this case, the new MLE is bp 2j 1 = bp 2j = n 2j 1+n 2j 2n In this case, the AIC is given by ( ) AIC = 2 C + m j=1 (n 2j 1 + n 2j ) log n 2j 1 + n 2j 2n + 2 (m 1). Similarly we can consider the histogram with m bins where k = 4m. We apply this to galaxy data for which AIC selects 14 bin size 28 14 7 log-like -189.19-197.71-209.52 AIC 432.38 421.43 431.03 Arnaud Doucet () February 2008 24 / 34

Application to Polynomial Regression We are given n observations fx i, y i g. We want to use the following model y = β 0 + β 1 x + β 2 x 2 +... + β k x k + ε, ε N 0, σ 2 where k is unknown. We have θ = β 0:k, σ 2 and l (θ) = n 2 log 2πσ2 1 2σ 2 n j=1 y j k m=0 β m x m j!! 2 We can easily establish that l b θ = n 2 log 2πbσ 2 n 2 Arnaud Doucet () February 2008 25 / 34

Experimental Results In this case for the polynomial model of order k we have k + 2 unknown parameters so This yields AIC (k) = n (log 2π + 1) + n log bσ 2 + 2 (k + 2) k 0 1 2 3 4 5 6 7 l b θ k 22.41 31.19 41.51 42.52 43.75 44.44 45.00 45.45 AIC -40.81-56.38-75.03-75.04-75.50-74.89-74.00-72.89 AIC selects model 4. Arnaud Doucet () February 2008 26 / 34

Application to Daily Temperature Data Daily minimum temperatures y i in January averaged from 1971 through 2000 for city i x i1 latitudes, x i2 longitudes and x i3 altitudes. A standard model is where ε i N 0, σ 2. y i = a 0 + a 1 x i1 + a 2 x i2 + a 3 x i3 + ε i But we would also like to compare models where some of the explanatory variables are omitted. Arnaud Doucet () February 2008 27 / 34

Results We have AIC (model) = n (log 2π + 1) + n log bσ 2 + 2 (nb. explanatority var. + 2) This yields model x 1, x 3 x 1, x 2, x 3 x 1, x 2 x 1 x 2, x 3 x 2 x 3 bσ 2 1.49 1.48 5.11 5.54 5.69 7.81 19.96 AIC 88.92 90.81 119.71 119.73 122.43 128.35 151.88 We select the model with ε i N (0, 1.49). y i = 40.49 1.11x i1 0.010x i3 + ε i Arnaud Doucet () February 2008 28 / 34

Selection of Order of Autoregressive Model We have the following model y k = m a i y k i + ε k, i.i.d. ε k N 0, σ 2. i=1 Assuming we have data y 1, y 2,..., y n and that y 0, y 1,..., y 1 m are deterministic then we can easily obtain the MLE estimate and check that l b θ = n 2 log 2πbσ 2 n 2 Since the model with order m has m + 1 unknown parameters then AIC (m) = n (log 2π + 1) + n log bσ 2 + 2 (m + 1) Arnaud Doucet () February 2008 29 / 34

Application to Canadian Lynx Data This corresponds to the logarithms of the annual numbers of lynx trapped from 1821 to 1934; n = 114 observations. We try 20 candidate models and m = 11 is selected using AIC. Arnaud Doucet () February 2008 30 / 34

Detection of Structural Changes We some times encounter the situation in which the stochastic structure of the data changes at a certain time or location. Let us start with a simple model where and µ n = Y n N θ1 θ 2 µ n, σ 2 if n < k if n k where k is the unknown so-called changepoint. We have N data and, for the model k, the likelihood is L θ 1, θ 2, σ 2 = k 1 n=1 N y n; θ 1, σ 2 N n=k N y n; θ 2, σ 2 Arnaud Doucet () February 2008 31 / 34

It can be easily established that bθ 1 = bσ 2 = 1 N 1 k 1 k 1 n=1 ( k 1 n=1 y n, bθ 2 = y n k N k + 1 N y n, n=k ) 2 N 2 bθ 1 + y n bθ 2 n=k So we have the maximum log-likelihood given by l b θ 1, bθ 2, bσ 2 = N 2 log 2πbσ 2 where we emphasize that bσ 2 is a function of k, The AIC criterion is given by AIC = N log 2πbσ 2 + N + 2 3 and the changepoint k can be automatically determined by nding the value of k that gives the smallest AIC. Arnaud Doucet () February 2008 32 / 34 N 2

Application to Factor Analysis Suppose we have x = (x 1,..., x p ) T a vector of mean µ and variance-covariance Σ. The factor analysis model is x = µ + Lf + ε where L is a p m matrix of factor loadings whereas f = (f 1,..., f m ) T and ε = (ε 1,..., ε p ) T are unobserved random vectors assumed to satisfy E [f ] = 0, Cov [f ] = I m, E [ε] = 0, Cov [ε] = Ψ = diag (Ψ 1,..., Ψ p ), Cov [f, ε] = 0. It then follows that Σ = LL T + Ψ. Arnaud Doucet () February 2008 33 / 34

Let S denote the sample covariance. Under a normal assumption for f and ε, it can be shown that the ML estimates of L and Ψ can be obtained by minimizing log jσj log jsj + tr Σ 1 S p subject to L T Ψ 1 L being a diagonal matrix. In this case, we have n AIC (m) = n p log (2π) + log Σ b + tr Σ 1 S o 1 +2 p (m + 1) m (m 1) 2 Arnaud Doucet () February 2008 34 / 34