Model Fitting and Model Selection. Model Fitting in Astronomy. Model Selection in Astronomy.

Similar documents
Model Fitting, Bootstrap, & Model Selection

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Asymptotic Statistics-VI. Changliang Zou

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Lecture 2: CDF and EDF

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Statistical Applications in the Astronomy Literature II Jogesh Babu. Center for Astrostatistics PennState University, USA

Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

Model comparison and selection

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Akaike Information Criterion

Generalized Linear Models

1 Statistics Aneta Siemiginowska a chapter for X-ray Astronomy Handbook October 2008

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Investigation of goodness-of-fit test statistic distributions by random censored samples

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

Testing and Model Selection

Master s Written Examination

Testing Statistical Hypotheses

Likelihood-based inference with missing data under missing-at-random

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Practical Statistics

LTI Systems, Additive Noise, and Order Estimation

Introductory Econometrics

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Recall the Basics of Hypothesis Testing

Hypothesis testing:power, test statistic CMS:

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Discrepancy-Based Model Selection Criteria Using Cross Validation

HANDBOOK OF APPLICABLE MATHEMATICS

The In-and-Out-of-Sample (IOS) Likelihood Ratio Test for Model Misspecification p.1/27

AR-order estimation by testing sets using the Modified Information Criterion

A Very Brief Summary of Statistical Inference, and Examples

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Summary of Chapters 7-9

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

simple if it completely specifies the density of x

Likelihood-Based Methods

Some General Types of Tests

2.2 Classical Regression in the Time Series Context

One-Sample Numerical Data

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Model selection in penalized Gaussian graphical models

2 Statistical Estimation: Basic Concepts

Statistics notes. A clear statistical framework formulates the logic of what we are doing and why. It allows us to make precise statements.

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Lecture 32: Asymptotic confidence sets and likelihoods

Stat 5101 Lecture Notes

1.5 Testing and Model Selection

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Data Analysis Stat 3: p-values, parameter estimation

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Estimation of a Two-component Mixture Model

3 Joint Distributions 71

STAT 100C: Linear models

Bayesian Inference in Astronomy & Astrophysics A Short Course

Statistical Inference

Central Limit Theorem ( 5.3)

Nonparametric Inference via Bootstrapping the Debiased Estimator

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA

AN EMPIRICAL LIKELIHOOD RATIO TEST FOR NORMALITY

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Robustness and Distribution Assumptions

A Very Brief Summary of Statistical Inference, and Examples

1. Introduction Over the last three decades a number of model selection criteria have been proposed, including AIC (Akaike, 1973), AICC (Hurvich & Tsa

STAT 461/561- Assignments, Year 2015

Random variables, distributions and limit theorems

Physics 403. Segev BenZvi. Classical Hypothesis Testing: The Likelihood Ratio Test. Department of Physics and Astronomy University of Rochester

Political Science 236 Hypothesis Testing: Review and Bootstrapping

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Econometrics I, Estimation

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

An Information Criteria for Order-restricted Inference

Goodness-of-fit tests for randomly censored Weibull distributions with estimated parameters

Brief Review on Estimation Theory

BTRY 4090: Spring 2009 Theory of Statistics

Fitting Narrow Emission Lines in X-ray Spectra

Greene, Econometric Analysis (6th ed, 2008)

Estimating AR/MA models

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

1 Hypothesis Testing and Model Selection

Consistency of test based method for selection of variables in high dimensional two group discriminant analysis

Hypothesis Testing (May 30, 2016)

Chapter 9. Non-Parametric Density Function Estimation

Testing for a Global Maximum of the Likelihood

Math 494: Mathematical Statistics

Stat 710: Mathematical Statistics Lecture 31

Transcription:

Model Fitting and Model Selection G. Jogesh Babu Penn State University http://www.stat.psu.edu/ babu http://astrostatistics.psu.edu Model Fitting Non-linear regression Density (shape) estimation Parameter estimation of the assumed model Goodness of fit Model Selection Nested (In quasar spectrum, should one add a broad absorption line BAL component to a power law continuum? Are there 4 planets or 6 orbiting a star?) Non-nested (is the quasar emission process a miture of blackbodies or a power law?). Model misspecification Model Fitting in Astronomy Model Selection in Astronomy Is the underlying nature of an X-ray stellar spectrum a non-thermal power law or a thermal gas with absorption? Are the fluctuations in the cosmic microwave background best fit by Big Bang models with dark energy or with quintessence? Are there interesting correlations among the properties of objects in any given class (e.g. the Fundamental Plane of elliptical galaies), and what are the optimal analytical epressions of such correlations? Interpreting the spectrum of an accreting black hole such as a quasar. Is it a nonthermal power law, a sum of featureless blackbodies, and/or a thermal gas with atomic emission and absorption lines? Interpreting the radial velocity variations of a large sample of solar-like stars. This can lead to discovery of orbiting systems such as binary stars and eoplanets, giving insights into star and planet formation. Interpreting the spatial fluctuations in the cosmic microwave background radiation. What are the best fit combinations of baryonic, Dark Matter and Dark Energy components? Are Big Bang models with quintessence or cosmic strings ecluded?

A good model should be Chandra Orion Ultradeep Project (COUP) Parsimonious (model simplicity) Conform fitted model to the data (goodness of fit) Easily generalizable. Not under-fit that ecludes key variables or effects Not over-fit that is unnecessarily comple by including etraneous eplanatory variables or effects. Under-fitting induces bias and over-fitting induces high variability. A good model should balance the competing objectives of conformity to the data and parsimony. $4Bn Chandra X-Ray observatory NASA 1999 1616 Bright Sources. Two weeks of observations in 2003 What is the underlying nature of a stellar spectrum? Successful model for high signal-to-noise X-ray spectrum. Complicated thermal model with several temperatures and element abundances (17 parameters) COUP source # 410 in Orion Nebula with 468 photons Thermal model with absorption AV 1 mag Fitting binned data using χ 2

Best-fit model: A plausible emission mechanism Limitations to χ 2 minimization Model assuming a single-temperature thermal plasma with solar abundances of elements. The model has three free parameters denoted by a vector θ. plasma temperature line-of-sight absorption normalization The astrophysical model has been convolved with complicated functions representing the sensitivity of the telescope and detector. The model is fitted by minimizing chi-square with an iterative procedure. ˆθ = arg min θ χ 2 (θ) = arg min θ N ( ) yi Mi(θ) 2. Chi-square minimization is a misnomer. It is parameter estimation by weighted least squares. Alternative approach to the model fitting based on EDF i=1 σi Fails when bins have too few data points. Binning is arbitrary. Binning involves loss of information. Data should be independent and identically distributed. Failure of i.i.d. assumption is common in astronomical data due to effects of the instrumental setup; e.g. it is typical to have 3 piels for each telescope point spread function (in an image) or spectrograph resolution element (in a spectrum). Thus adjacent piels are not i.i.d. Does not provide clear procedures for adjudicating between models with different numbers of parameters (e.g. one- vs. two-temperature models) or between different acceptable models (e.g. local minima in χ 2 (θ) space). Unsuitable to obtain confidence intervals on parameters when comple correlations between the estimators of parameters are present (e.g. non-parabolic shape near the minimum in χ 2 (θ) space). Fitting to unbinned EDF Correct model family, incorrect parameter value Thermal model with absorption set at AV 10 mag Misspecified model family! Power law model with absorption set at AV 1 mag Can the power law model be ecluded with 99% confidence

EDF based Goodness of Fit Empirical Distribution Function 1 Statistics based on EDF 2 Kolmogorov-Smirnov Statistic 3 Processes with estimated parameters 4 Bootstrap Parametric bootstrap Nonparametric bootstrap 5 Confidence Limits Under Model Misspecification Statistics based on EDF K-S Confidence bands Kolmogrov-Smirnov: Dn = sup Fn() F (), Cramér-von Mises: sup(fn() F ()) +, sup(fn() F ()) (Fn() F ()) 2 df () (Fn() F ()) 2 Anderson - Darling: df () F ()(1 F ()) is more sensitive at tails. These statistics are distribution free if F is continuous & univariate. No longer distribution free if either F is not univariate or parameters of F are estimated. F = Fn ± Dn(α)

Kolmogorov-Smirnov Table Multivariate Case Eample Paul B. Simpson (1951) F (, y) = a 2 y + (1 a)y 2, 0 <, y < 1 KS probabilities are invalid when the model parameters are estimated from the data. Some astronomers use them incorrectly. Lillifors (1964) (X1, Y1) F. F1 denotes the EDF of (X1, Y1) P ( F1(, y) F (, y) <.72, for all, y) >.065 if a = 0, (F (, y) = y 2 ) <.058 if a =.5, (F (, y) = 1 y( + y)) 2 Numerical Recipe s treatment of a 2-dim KS test is mathematically invalid. Processes with estimated parameters Bootstrap {F (.; θ) : θ Θ} a family of continuous distributions Θ is a open region in a p-dimensional space. X1,..., Xn sample from F Test F = F (.; θ) for some θ = θ0 Kolmogorov-Smirnov, Cramér-von Mises statistics, etc., when θ is estimated from the data, are continuous functionals of the empirical process Yn(; ˆθn) = n ( Fn() F (; ˆθn) ) ˆθn = θn(x1,..., Xn) is an estimator θ Gn is an estimator of F, based X1,..., Xn. X 1,..., X n i.i.d. from Gn ˆθ n = θn(x 1,..., X n) F (.; θ) is Gaussian with θ = (µ, σ 2 ) If ˆθn = ( Xn, s 2 n), then ˆθ n = ( X n, s 2 n ) Parametric bootstrap if Gn = F (.; ˆθn) X1,..., X n i.i.d. F (.; ˆθn) Nonparametric bootstrap if Gn = Fn (EDF) Fn the EDF of X1,..., Xn

Parametric bootstrap Nonparametric bootstrap X 1,..., X n sample generated from F (.; ˆθn) In Gaussian case ˆθ n = ( X n, s 2 n ). Both and n sup Fn() F (; ˆθn) n sup Fn() F (; ˆθ n) have the same limiting distribution In XSPEC package, the parametric bootstrap is command FAKEIT, which makes Monte Carlo simulation of specified spectral model. Numerical Recipes describes a parametric bootstrap (random sampling of a specified pdf) as the transformation method of generating random deviates. X 1,..., X n sample from Fn i.e., a simple random sample from X1,..., Xn. Bias correction is needed. Both and Bn() = n(fn() F (; ˆθn)) n sup Fn() F (; ˆθn) sup n ( Fn() F (; ˆθ n) ) Bn() have the same limiting distribution. XSPEC does not provide a nonparametric bootstrap capability Model misspecification Need for such bias corrections in special situations are well documented in the bootstrap literature. χ 2 type statistics (Babu, 1984, Statistics with linear combinations of chi-squares as weak limit. Sankhyā, Series A, 46, 85-93.) U-statistics (Arcones and Giné, 1992, On the bootstrap of U and V statistics. The Ann. of Statist., 20, 655 674.) X1,..., Xn data from unknown H. H may or may not belong to the family {F (.; θ) : θ Θ} H is closest to F (., θ0) Kullback-Leibler information ( ) h() log h()/f(; θ) dν() 0 log h() h()dν() < h() log f(; θ0)dν() = maθ Θ h() log f(; θ)dν()

Confidence limits under model misspecification For any 0 < α < 1, P ( n sup Fn() F (; ˆθn) (H() F (; θ0)) C α) α 0 C α is the α-th quantile of Similar conclusions can be drawn for von Mises-type distances (Fn() F (; ˆθn) (H() F (; θ0)) ) 2 df (; θ0), (Fn() F (; ˆθn) (H() F (; θ0)) ) 2 df (; ˆθn). sup n ( F n() F (; ˆθ n) ) n ( Fn() F (; ˆθn) ) This provide an estimate of the distance between the true distribution and the family of distributions under consideration. EDF based fitting requires little or no probability distributional assumptions such as Gaussianity or Poisson structure. Discussion so far MLE and Model Selection 1 Model Selection Framework K-S goodness of fit is often better than Chi-square test. K-S cannot handle heteroscadastic errors Anderson-Darling is better in handling the tail part of the distributions. K-S probabilities are incorrect if the model parameters are estimated from the same data. K-S does not work in more than one dimension. Bootstrap helps in the last two cases. So far we considered model fitting part. We shall now discuss model selection issues. 2 Hypothesis testing for model selection: Nested models 3 MLE based hypotheses tests 4 Limitations 5 Penalized likelihood 6 Information Criteria based model selection Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC)

Model Selection Framework (based on likelihoods) Loglikelihood Hypothesis testing for model selection: Nested models Observed data D M1,..., Mk are models for D under consideration Likelihood f(d θj; Mj) and loglikelihood l(θj) = log f(d θj; Mj) for model Mj. f(d θj;mj) is the probability density function (in the continuous case) or probability mass function (in the discrete case) evaluated at data D. θj is a pj dimensional parameter vector. Eample D = (X1,..., Xn), Xi, i.i.d. N(µ, σ 2 ) r.v. Likelihood { } f(d µ, σ 2 ) = (2πσ 2 ) n/2 ep 1 n 2σ 2 (Xi µ) 2 i=1 Most of the methodology can be framed as a comparison between two models M1 and M2. MLE based hypotheses tests H0 : θ = θ0, ˆθ MLE l(θ) loglikelihood at θ θ 0 θ^ θ Wald Test Based on the (standardized) distance between θ0 and ˆθ Likelihood Ratio Test Based on the distance from l(θ0) to l(ˆθ). Rao Score Test Based on the gradient of the loglikelihood (called the score function) at θ0. These three MLE based tests are equivalent to the first order of asymptotics, but differ in the second order properties. No single test among these is uniformly better than the others. The model M1 is said to be nested in M2, if some coordinates of θ1 are fied, i.e. the parameter vector is partitioned as θ2 = (α, γ) and θ1 = (α, γ0) γ0 is some known fied constant vector. Comparison of M1 and M2 can be viewed as a classical hypothesis testing problem of H0 : γ = γ0. Eample M2 Gaussian with mean µ and variance σ 2 M1 Gaussian with mean 0 and variance σ 2 The model selection problem here can be framed in terms of statistical hypothesis testing H0 : µ = 0, with free parameter σ. Hypothesis testing is a criteria used for comparing two models. Classical testing methods are generally used for nested models. Wald Test Statistic Wn = (ˆθn θ0) 2 /V ar(ˆθn) χ 2 The standardized distance between θ0 and the MLE ˆθn. In general V ar(ˆθn) is unknown V ar(ˆθ) 1/I(ˆθn), I(θ) is the Fisher s information Wald test rejects H0 : θ = θ0 when I(ˆθn)(ˆθn θ0) 2 is large. Likelihood Ratio Test Statistic l(ˆθn) l(θ0) Rao s Score (Lagrangian Multiplier) Test Statistic ( n ) 2 S(θ0) = 1 f (Xi; θ0) ni(θ0) f(xi; θ0) i=1 X1,..., Xn are independent random variables with a common probability density function f(.; θ).

Eample In the case of data from normal (Gaussian) distribution with known variance σ 2, f(y; θ) = 1 { ep 1 } (y θ)2, I(θ) = 1 2πσ 2σ2 σ 2 S(θ0) = 1 ni(θ0) ( n ) 2 f (Xi; θ0) = n f(xi; θ0) σ 2 ( Xn θ0) 2 i=1 Regression Contet y1,..., yn data with Gaussian residuals, then the loglikelihood l is n { 1 l(β) = log ep 1 } i=1 2πσ 2σ 2 (yi iβ) 2 Limitations Caution/Objections M1 and M2 are not treated symmetrically as the null hypothesis is M1. Cannot accept H0 Can only reject or fail to reject H0. Larger samples can detect the discrepancies and more likely to lead to rejection of the null hypothesis. Penalized likelihood If M1 is nested in M2, then the largest likelihood achievable by M2 will always be larger than that of M1. Adding a penalty on larger models would achieve a balance between over-fitting and under-fitting, leading to the so called Penalized Likelihood approach. Information criteria based model selection procedures are penalized likelihood procedures. Information Criteria based model selection The traditional maimum likelihood paradigm provides a mechanism for estimating the unknown parameters of a model having a specified dimension and structure. Hirotugu Akaike (1927-2009) Akaike etended this paradigm in 1973 to the case, where the model dimension is also unknown.

Akaike Information Criterion (AIC) Grounding in the concept of entropy, Akaike proposed an information criterion (AIC), now popularly known as Akaike Information Criterion, where both model estimation and selection could be simultaneously accomplished. AIC for model Mj is 2l(ˆθj) 2kj. The term 2l(ˆθj) is known as the goodness of fit term, and 2kj is known as the penalty. The penalty term increase as the compleity of the model grows. AIC is generally regarded as the first model selection criterion. It continues to be the most widely known and used model selection tool among practitioners. Bayesian Information Criterion (BIC) BIC is also known as the Schwarz Bayesian Criterion 2l(ˆθj) kj log n BIC is consistent unlike AIC Like AIC, the models need not be nested to use BIC AIC penalizes free parameters less strongly than does the BIC Conditions under which these two criteria are mathematically justified are often ignored in practice. Some practitioners apply them even in situations where they should not be applied. Caution Sometimes these criteria are given a minus sign so the goal changes to finding the minimizer. Advantages of AIC Does not require the assumption that one of the candidate models is the true or correct model. All the models are treated symmetrically, unlike hypothesis testing. Can be used to compare nested as well as non-nested models. Can be used to compare models based on different families of probability distributions. Disadvantages of AIC Large data are required especially in comple modeling frameworks. Leads to an inconsistent model selection if there eists a true model of finite order. That is, if k0 is the correct number of parameters, and ˆk = ki (i = arg maj 2l(ˆθj) 2kj), then limn P (ˆk > k0) > 0. That is even if we have very large number of observations, ˆk does not approach the true value. References Akaike, H. (1973). Information Theory and an Etension of the Maimum Likelihood Principle. In Second International Symposium on Information Theory, (B. N. Petrov and F. Csaki, Eds). Akademia Kiado, Budapest, 267-281. Babu, G. J., and Bose, A. (1988). Bootstrap confidence intervals. Statistics & Probability Letters, 7, 151-160. Babu, G. J., and Rao, C. R. (1993). Bootstrap methodology. In Computational statistics, Handbook of Statistics 9, C. R. Rao (Ed.), North-Holland, Amsterdam, 627-659. Babu, G. J., and Rao, C. R. (2003). Confidence limits to the distance of the true distribution from a misspecified family by bootstrap. J. Statistical Planning and Inference, 115, no. 2, 471-478. Babu, G. J., and Rao, C. R. (2004). Goodness-of-fit tests when parameters are estimated. Sankhyā, 66, no. 1, 63-74. Getman, K. V., and 23 others (2005). Chandra Orion Ultradeep Project: Observations and source lists. Astrophys. J. Suppl., 160, 319-352.