Lecture Stat Information Criterion

Similar documents
CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Winter 2016 Instructor: Victor Aguirregabiria

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

GMM, HAC estimators, & Standard Errors for Business Cycle Statistics

Statistics: Learning models from data

Exercises Chapter 4 Statistical Hypothesis Testing

6.867 Machine Learning

Measuring robustness

Regression I: Mean Squared Error and Measuring Quality of Fit

6.867 Machine Learning

PANEL DATA RANDOM AND FIXED EFFECTS MODEL. Professor Menelaos Karanasos. December Panel Data (Institute) PANEL DATA December / 1

Identifiability, Invertibility

Chapter 2. Dynamic panel data models

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Size Distortion and Modi cation of Classical Vuong Tests

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

1 The Multiple Regression Model: Freeing Up the Classical Assumptions

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Time Series Analysis Fall 2008

Lecture 1: Supervised Learning

Economics 241B Review of Limit Theorems for Sequences of Random Variables

CHAPTER 6: SPECIFICATION VARIABLES

Parametric Inference on Strong Dependence

SEQUENTIAL ESTIMATION OF DYNAMIC DISCRETE GAMES. Victor Aguirregabiria (Boston University) and. Pedro Mira (CEMFI) Applied Micro Workshop at Minnesota

STAT 100C: Linear models

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Winter 2014 Instructor: Victor Aguirregabiria

Maximum Likelihood Estimation. only training data is available to design a classifier

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Föreläsning /31

Stochastic Convergence, Delta Method & Moment Estimators

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

EM Algorithm II. September 11, 2018

A Very Brief Summary of Statistical Inference, and Examples

simple if it completely specifies the density of x

Submitted to the Brazilian Journal of Probability and Statistics

Modern Methods of Data Analysis - WS 07/08

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Chapter 6: Endogeneity and Instrumental Variables (IV) estimator

Statistics and Data Analysis

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Probability and Estimation. Alan Moses

Nonparametric Density Estimation

COMP90051 Statistical Machine Learning

Lecture : Probabilistic Machine Learning

Generalized linear models

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

A Few Notes on Fisher Information (WIP)

Modern Methods of Data Analysis - WS 07/08

On the Power of Tests for Regime Switching

Composite Hypotheses and Generalized Likelihood Ratio Tests

Statistics & Data Sciences: First Year Prelim Exam May 2018

GMM-based inference in the AR(1) panel data model for parameter values where local identi cation fails

Statistics 910, #5 1. Regression Methods

ECONOMETRICS FIELD EXAM Michigan State University May 9, 2008

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

EL1820 Modeling of Dynamical Systems

ECE521 week 3: 23/26 January 2017

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Trending Models in the Data

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Multivariate Analysis and Likelihood Inference

Time Series Models and Inference. James L. Powell Department of Economics University of California, Berkeley

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

1 The Well Ordering Principle, Induction, and Equivalence Relations

Gaussian and Linear Discriminant Analysis; Multiclass Classification

xtunbalmd: Dynamic Binary Random E ects Models Estimation with Unbalanced Panels

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

10. Linear Models and Maximum Likelihood Estimation

The problem is to infer on the underlying probability distribution that gives rise to the data S.

Brief Review on Estimation Theory

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Institute of Actuaries of India

Introduction: structural econometrics. Jean-Marc Robin

STAT5044: Regression and Anova. Inyoung Kim

Bias-Variance Tradeoff

9.1 Orthogonal factor model.

AGEC 661 Note Fourteen

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

Testing composite hypotheses applied to AR-model order estimation; the Akaike-criterion revised

Lecture 7 Introduction to Statistical Decision Theory

A General Overview of Parametric Estimation and Inference Techniques.

Covariance function estimation in Gaussian process regression

1. The OLS Estimator. 1.1 Population model and notation

Estimation of Dynamic Regression Models

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 32: Asymptotic confidence sets and likelihoods

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

ECON 4160, Autumn term Lecture 1

Generalized Method of Moments (GMM) Estimation

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

Heteroskedasticity; Step Changes; VARMA models; Likelihood ratio test statistic; Cusum statistic.

Probability on a Riemannian Manifold

Topic 4: Model Specifications

Linear Regression and Its Applications

Lecture 2: Univariate Time Series

Transcription:

Lecture Stat 461-561 Information Criterion Arnaud Doucet February 2008 Arnaud Doucet () February 2008 1 / 34

Review of Maximum Likelihood Approach We have data X i i.i.d. g (x). We model the distribution of these data by a parametric model ff (xj θ) ; θ 2 Θ R p g. We estimate θ by ML; that is bθ = arg maxl (θ) where l (θ) := n k=1 log f (X k j θ) θ2θ We know that as n! then bθ converges towards θ = arg minkl (g (x) ; f (xj θ)) θ2θ where Z KL (g (x) ; f (xj θ)) := g (x) log g (x) f (xj θ) dx Arnaud Doucet () February 2008 2 / 34

A Central Limit Theorem Assuming the MLE is asymptotically consistent, then we have l (θ) 0 = l (θ) = θ θ + 2 l (θ) b θ θ θ T θ θ +... θ The law of large numbers yields 1 l (θ) n θ θ T! J (θ ) := θ The CLT yields where b θ p n 1 n Z I (θ ) = g (x) l (θ) θ θ Z log f (xj θ) θ g (x) 2 log f (xj θ) θ θ T D! N (0, I (θ )) θ log f (xj θ) θ T θ θ dx dx Arnaud Doucet () February 2008 3 / 34

Using Slutzky s theorem we have p n b θ θ D! N 0, J 1 (θ ) I (θ ) J 1 (θ ). This is sometimes known as the sandwitch variance. When g (x) = f (xj θ ) then we have J (θ ) = I (θ ) and the standard CLT follows p n b θ θ D! N 0, I 1 (θ ) Arnaud Doucet () February 2008 4 / 34

Evaluating The Statistical Model We want to compare our model f xj b θ to g (x). One way consists of evaluating KL g (x) ; f xj b Z θ = g (x) log g (x) dx and the larger R g (x) log f true one. The crucial issue is to obtain an estimate of E g hlog f X j b i Z θ = g (x) log f Z g (x) log f xj b θ dx. xj b θ dx, the closer the model is to the xj b θ dx. Clearly an estimate of E g hlog f X j b i θ is n 1 l b θ and an estimate of ne g hlog f X j b i θ is l b θ. Arnaud Doucet () February 2008 5 / 34

Selecting a Model In practice, it is di cult to precisely capture the true structure of given phenomena from a limited number of data. Often several candidate statistical models are selected and we want to select the model that closely approximates the true distribution of the data. Given m candidate models ff i (xj θ i ) ; i = 1,..., mg and the associated ML estimates bθ i, a simple solution would consist of selecting the model j where j = arg max l i b θ i ; i2f1,...,mg i.e. picking the model for which KL g (x) ; f i xj b θ i is the smallest. Arnaud Doucet () February 2008 6 / 34

Be Careful! This approach does not provide a fair comparison of models. The quantity l i b θ i contains a bias as an estimator of h ne g log f i X j b θ i i. The reason is that the data X 1,..., X n are used twice: to estimate bθ i and to approximate the expectation with respect to g. We will show that the size of the bias depends on the size p of the parameter θ. Arnaud Doucet () February 2008 7 / 34

Relationship between log-likelihood and expected log-likelihood Let us denote θi = arg min KL (g (x) ; f i (xj θ i )) the true parameter. We necessarily have h E g log f i X j b i θ i E g [log f i (X j θi )] whereas l i b θ i l i (θ i ). To correct for this bias, we want to estimate h h b i (g) = E g l i b θ i ne g log f i X j b ii θ i where remember that and bθ i = bθ i (X 1:n ). l i b θ i = n k=1 log f i X k j b θ i Arnaud Doucet () February 2008 8 / 34

Information Criterion The information criterion for model i is de ned as IC (i) = 2 n k=1 log f i X k j b θ i + 2 festimator for b i (g)g IC(i) is a biased-corrected estimate of minus the expected log-likelihood. So given a collection of models, we will select the model j = arg min IC (i) i2f1,...,mg Arnaud Doucet () February 2008 9 / 34

Derivation of Bias We want to estimate the bias b (g) which is given by E g hl b θ (X 1:n ) ne g hlog f i = E g hl b θ (X 1:n ) l (θ ) {z } D 1 +E g [l (θ ) ne g [log f (X j θ )]] {z } D 2 h +E g X j b ii θ (X 1:n ) ne g [log f (X j θ )] ne g hlog f X j b ii θ (X 1:n ) {z } D 3 Arnaud Doucet () February 2008 10 / 34

Calculation of second term We have D 2 = E g [l (θ ) ne g [log f (X j θ )]] h i n = E g k=1 log f (X k j θ ) ne g [log f (X j θ )] = 0 No bias for this term... Arnaud Doucet () February 2008 11 / 34

Calculation of third term Let us denote h η b θ = E g log f X j b i θ. By performing a Taylor expansion around θ, we obtain η b θ = η (θ η (θ) T ) + b θ θ θ + 1 b θ θ T 2 η (θ) θ 2 θ θ T = 0 then θ As η(θ) θ where J (θ ) = η b θ η (θ ) 2 η (θ) θ θ T = θ 1 b θ θ T J (θ ) b θ θ 2 Z g (x) 2 log f (xj θ) θ θ T θ θ dx b θ θ Arnaud Doucet () February 2008 12 / 34

It follows that i D 3 = E g hn η (θ ) η b θ n b 2 E g θ θ T J (θ ) b θ θ = n 2 E g tr J (θ ) b θ θ bθ θ T = n b 2 tr J (θ ) E g θ θ bθ θ T We have from the CLT established previously we have E g b θ θ bθ θ T 1 n J (θ ) 1 I (θ ) J (θ ) 1 so D 3 1 2 tr n I (θ ) J (θ ) 1o Arnaud Doucet () February 2008 13 / 34

Calculation of rst term We have l (θ) = l b θ so + l (θ) θ T bθ θ bθ + 1 θ 2 bθ T 2 l (θ) θ θ T l (θ ) l b θ n T θ bθ J (θ ) θ 2 b θ θ bθ bθ +... D 1 = E g hl b θ (X 1:n ) n 2 E g θ bθ = n 2 E g i l (θ ) T J (θ ) θ T J (θ ) tr θ bθ θ = 1 2 tr I (θ ) J (θ ) 1 bθ bθ Arnaud Doucet () February 2008 14 / 34

Estimate of the Bias We have b (g) = D 1 + D 2 + D 3 1 n 2 tr I (θ ) J (θ ) 1o + 0 + 1 n 2 tr I (θ ) J (θ ) 1o = tr ni (θ ) J (θ ) 1o Let bi and bj be consistent estimate of I (θ ) and J (θ ), say bi = 1 log f (X k j θ) n n log f (X k j θ) k=1 θ b θ θ T, b θ 1 bj = 2 log f (X k j θ) n n k=1 θ θ T then we can estimate the bias through bb (g) = tr b I bj 1 b θ Arnaud Doucet () February 2008 15 / 34

Information Criterion We have for the information criterion for model i IC (i) = 2 n k=1 log f i X k j b θ i + 2tr b I i bj i 1 Assuming g (x) = f i (xj θi ) then I i (θi ) = J i (θi ) 1 so b i (g) = p i (where p i is the dimension of θ i ) so in this case AIC (i) = 2 n k=1 log f i X k j b θ i + 2p i. Arnaud Doucet () February 2008 16 / 34

Checking the Equality of Two Discrete Distributions Assume we have two sets of data each having k categories and Category 1 2 k Data set 1 n 1 n 2 n k Data set 2 m 1 m 2 m k where we have n 1 + + n k = n observations for the rst dataset and m 1 + + m k = m for the second. We assume these datasets follow the multinomial distributions p (n 1,, n k j p 1,, p k ) = p (m 1,, m k j q 1,, q k ) = n! n 1! n k! pn 1 1 pn k k, m! m 1! m k! qm 1 1 q m k k We want to check whether p 1,, p k 6= q 1,, q k or p 1,, p k = q 1,, q k Arnaud Doucet () February 2008 17 / 34

Assume the two distributions are di erent then l (p 1:k, q 1:k ) = log n! We have the MLE So l (bp 1:k, bq 1:k ) = C + with C = log n! + log m! AIC = 2 C + k j=1 + log m! k (log n j! + n j log p j ) j=1 bp j = n j n, bq j = m j n k j=1 k (log m j! + m j log q j ) j=1 n j log n j n + m j log m j n k j=1 (log n j! + log m j!) and n j log n j n + m j log m! j + 4 (k 1) n Arnaud Doucet () February 2008 18 / 34

If we assume the two distributions are equal then l (r) = C k (n j + m j ) log r j j=1 and We have l (br 1:k ) = C + br j = n j + m j n + m k j=1 nj + m j (n j + m j ) log n + m and AIC = 2 C + k j=1 nj +! m j (n j + m j ) log + 2 (k 1) n + m Arnaud Doucet () February 2008 19 / 34

Example Consider the following data Category 1 2 3 4 5 Data set 1 304 800 400 57 323 Data set 2 174 509 362 80 214 From this table we can deduce Category 1 2 3 4 5 bp j 0.16 0.42 0.21 0.03 0.17 bq j 0.13 0.38 0.27 0.06 0.16 br j 0.15 0.41 0.24 0.04 0.17 Arnaud Doucet () February 2008 20 / 34

Ignoring the constant C, the maximum log-likelihood of the models are Model 1 : Model 2 : k n j log n j j=1 k (n j + m j ) log j=1 n + m j log m j = n 4567.36, nj + m j = n + m 4585.61. The number of free parameters of the models is 2 (k 1) = 8 in model 1 and 4 in model 2. So it follows that AIC for model 1 and model 2 is respectively 9150,73 and 9179,22 so AIC selects Model 1. Arnaud Doucet () February 2008 21 / 34

Determining the number of bin size of an histogram Histograms are use for representing the properties of a set of observations obtained from either a discrete or continuous distribution. Assume we have a histogram fn 1, n 2,..., n k g where k is refer to as the bin size. If k is too large or too small, then it is di cult to capture the characteristics of the true distribution. We can think of an histogram as a model speci ed by a multinomial distribution p (n 1,, n k j p 1,, p k ) = where n 1 + + n k = n, p 1 + + p k = 1. n! n 1! n k! pn 1 1 pn k k Arnaud Doucet () February 2008 22 / 34

In this case, we have seen that the MLE is bp j = n j n l (bp 1:k ) = C + with C = log n! k j=1 log n j! We have k 1 free parameters so ( AIC = 2 C + k j=1 k j=1 n j log n j n n j log n j n ) and + 2 (k 1). We want to compare this model to a model with lower resolutions. Arnaud Doucet () February 2008 23 / 34

To compare the histogram model with a simpler one, we may assume p 2j 1 = p 2j for j = 1,...m and for sake of simplicity, we consider here k = 2m. In this case, the new MLE is bp 2j 1 = bp 2j = n 2j 1+n 2j 2n In this case, the AIC is given by ( ) AIC = 2 C + m j=1 (n 2j 1 + n 2j ) log n 2j 1 + n 2j 2n + 2 (m 1). Similarly we can consider the histogram with m bins where k = 4m. We apply this to galaxy data for which AIC selects 14 bin size 28 14 7 log-like -189.19-197.71-209.52 AIC 432.38 421.43 431.03 Arnaud Doucet () February 2008 24 / 34

Application to Polynomial Regression We are given n observations fx i, y i g. We want to use the following model y = β 0 + β 1 x + β 2 x 2 +... + β k x k + ε, ε N 0, σ 2 where k is unknown. We have θ = β 0:k, σ 2 and l (θ) = n 2 log 2πσ2 1 2σ 2 n j=1 y j k m=0 β m x m j!! 2 We can easily establish that l b θ = n 2 log 2πbσ 2 n 2 Arnaud Doucet () February 2008 25 / 34

Experimental Results In this case for the polynomial model of order k we have k + 2 unknown parameters so This yields AIC (k) = n (log 2π + 1) + n log bσ 2 + 2 (k + 2) k 0 1 2 3 4 5 6 7 l b θ k 22.41 31.19 41.51 42.52 43.75 44.44 45.00 45.45 AIC -40.81-56.38-75.03-75.04-75.50-74.89-74.00-72.89 AIC selects model 4. Arnaud Doucet () February 2008 26 / 34

Application to Daily Temperature Data Daily minimum temperatures y i in January averaged from 1971 through 2000 for city i x i1 latitudes, x i2 longitudes and x i3 altitudes. A standard model is where ε i N 0, σ 2. y i = a 0 + a 1 x i1 + a 2 x i2 + a 3 x i3 + ε i But we would also like to compare models where some of the explanatory variables are omitted. Arnaud Doucet () February 2008 27 / 34

Results We have AIC (model) = n (log 2π + 1) + n log bσ 2 + 2 (nb. explanatority var. + 2) This yields model x 1, x 3 x 1, x 2, x 3 x 1, x 2 x 1 x 2, x 3 x 2 x 3 bσ 2 1.49 1.48 5.11 5.54 5.69 7.81 19.96 AIC 88.92 90.81 119.71 119.73 122.43 128.35 151.88 We select the model with ε i N (0, 1.49). y i = 40.49 1.11x i1 0.010x i3 + ε i Arnaud Doucet () February 2008 28 / 34

Selection of Order of Autoregressive Model We have the following model y k = m a i y k i + ε k, i.i.d. ε k N 0, σ 2. i=1 Assuming we have data y 1, y 2,..., y n and that y 0, y 1,..., y 1 m are deterministic then we can easily obtain the MLE estimate and check that l b θ = n 2 log 2πbσ 2 n 2 Since the model with order m has m + 1 unknown parameters then AIC (m) = n (log 2π + 1) + n log bσ 2 + 2 (m + 1) Arnaud Doucet () February 2008 29 / 34

Application to Canadian Lynx Data This corresponds to the logarithms of the annual numbers of lynx trapped from 1821 to 1934; n = 114 observations. We try 20 candidate models and m = 11 is selected using AIC. Arnaud Doucet () February 2008 30 / 34

Detection of Structural Changes We some times encounter the situation in which the stochastic structure of the data changes at a certain time or location. Let us start with a simple model where and µ n = Y n N θ1 θ 2 µ n, σ 2 if n < k if n k where k is the unknown so-called changepoint. We have N data and, for the model k, the likelihood is L θ 1, θ 2, σ 2 = k 1 n=1 N y n; θ 1, σ 2 N n=k N y n; θ 2, σ 2 Arnaud Doucet () February 2008 31 / 34

It can be easily established that bθ 1 = bσ 2 = 1 N 1 k 1 k 1 n=1 ( k 1 n=1 y n, bθ 2 = y n k N k + 1 N y n, n=k ) 2 N 2 bθ 1 + y n bθ 2 n=k So we have the maximum log-likelihood given by l b θ 1, bθ 2, bσ 2 = N 2 log 2πbσ 2 where we emphasize that bσ 2 is a function of k, The AIC criterion is given by AIC = N log 2πbσ 2 + N + 2 3 and the changepoint k can be automatically determined by nding the value of k that gives the smallest AIC. Arnaud Doucet () February 2008 32 / 34 N 2

Application to Factor Analysis Suppose we have x = (x 1,..., x p ) T a vector of mean µ and variance-covariance Σ. The factor analysis model is x = µ + Lf + ε where L is a p m matrix of factor loadings whereas f = (f 1,..., f m ) T and ε = (ε 1,..., ε p ) T are unobserved random vectors assumed to satisfy E [f ] = 0, Cov [f ] = I m, E [ε] = 0, Cov [ε] = Ψ = diag (Ψ 1,..., Ψ p ), Cov [f, ε] = 0. It then follows that Σ = LL T + Ψ. Arnaud Doucet () February 2008 33 / 34

Let S denote the sample covariance. Under a normal assumption for f and ε, it can be shown that the ML estimates of L and Ψ can be obtained by minimizing log jσj log jsj + tr Σ 1 S p subject to L T Ψ 1 L being a diagonal matrix. In this case, we have n AIC (m) = n p log (2π) + log Σ b + tr Σ 1 S o 1 +2 p (m + 1) m (m 1) 2 Arnaud Doucet () February 2008 34 / 34