Frequentist-Bayesian Model Comparisons: A Simple Example

Size: px

Start display at page:

Download "Frequentist-Bayesian Model Comparisons: A Simple Example"

Horatio Haynes
5 years ago
Views:

1 Frequentist-Bayesian Model Comparisons: A Simple Example Consider data that consist of a signal y with additive noise: Data vector (N elements): D = y + n The additive noise n has zero mean and diagonal covariance matrix: Linear model: n = 0 C = diag(σ 2 j). y = Xθ X = design matrix θ = parameter vector with M elements For any given choice of θ, we can estimate the noise vector as n = D ŷ. Cost function for least squares: χ 2 (θ) = n 1 C 1 n = (D ŷ) C 1 (D ŷ) [sometimes written as C(θ) or Q(θ)] Minimizing the cost function: θ χ 2 (θ) = 0 gives a system of equations with solution θ = ( X C 1 X ) 1 X C 1 D. The existence of a solution depends on whether matrices are invertible. A bad choice of basis functions in X may lead to an ill-conditioned matrix. Singular value decomposition or a better choice of basis functions may be needed. For a noise PDF that is Gaussian, the likelihood function L(θ) e χ2 (θ)/2. The covariance matrix of the parameters is P θ = θθ = ( X C 1 X ) 1. 1

2 Comments For a linear model, the solution is unique if there is no noise. The least-squares solution with non-zero noise is unique: there is only one minimum in the cost function. The cost function is quadratic in δθ = θ θ true. However, different realizations of the data will yield different solutions whose range is quantified by P θ. 2

3 Consider data D where each element is d i = y i + n i = a + bx i + n i, i.e. a straight line. The design matrix is then a 2-column N-row array and θ is a 2-element vector: 1 x 1 ( ) ( ) 1 x X = 2.., θ = θ1 a θ 2 b 1 x N Now suppose we do not know that the form of the data but we wish to find the best model. As the universe of models, consider just the pair: M 1 : M 2 : y i = θ 1 = constant y i = θ 1 + θ 2 x i = line For a given realization of a data set, we can estimate the parameters of each model. We then want to test how good each model is and test them against each other. These are the questions of statistical inference: 1. How do we decide whether each model is a good fit or not? 2. Given that M1 is a subset of M2 when θ 2 = 0, how do we gauge that the extra parameter in M2 is warranted (demanded) by the data? 3. What are acceptable values for estimates of the parameters of the better model? The answers to these questions are different for frequentist vs. Bayesian approaches. 3

4 Gaussian Noise Model Let the noise be N(0, C): 1 f n (n) = (2π) N/2 (det C) 1/2 e 1 2 n C 1 n Note that the argument in the exponential is a quadratic form. The likelihood function for the parameters is obtained by using the estimate n = D y(θ) that depends on a particular choice of the parameter vector θ: L(θ) = 1 (2π) N/2 (det C) 1/2 e 1 2 n C 1 n For this case and this case only, minimizing χ 2 yields identical results to maximizing L: Different situations are encountered: Least squares estimate= maximum likelihood estimate 1. The noise covariance matrix C is known (in shape and element by element values) 2. The form of C is known but the values are not. 3. Nothing is known about C a priori. In the example here we assume case 1, that C is known. 4

5 Frequentist Approach Testing a model: Calculate the minimum of the cost function, χ 2 min = χ2 ( θ). For shorthand, we will call the minimum just χ 2. If the model is a good (if not perfect) match to the true underlying model, we expect that over an ensemble of realizations χ 2 = N M = number of degrees of freedom σ χ 2 = 2(N M) = σ χ 2 χ 2 = ( ) 1/2 2. N M = degrees of freedom matter (could have N and M both large) Note that χ 2 (θ) varies quadratially in δθ = θ θ. Why? Reduced chi-square: χ 2 r = χ 2 /(N M). A good model has χ 2 r = 1 and σχ 2 r = [2/(N M)]1/2 A model can be assessed by calculating the probability that the estimated χ 2 r is statistically consistent with 1 given N M. See 7.2 of Gregory. 5

6 Frequentist Approach (continued) Parameter estimation errors: Use P θ to obtain the variances of each parameter (diagonal elements) and correlations between parameters (off-diagonal elements). The interpretation of the parameter values is like this: these are the variations in estimates of the parameters expected from different realizations of the data chosen from an ensemble. Model comparison with the F test: While models can be individually assessed as above, they can also be compared by using the quantitiy: F 12 = χ2 r1 χ 2 r2 = χ2 1/(N M 1 ) χ 2 2/(N M 2 ). The PDF of F 12 has an analytical form (see 6.5, Equation 6.36 of Gregory). One can calculate whether one model is better than another by calculating the probability that the value of F 12 would be obtained by chance if the reduced χ 2 is the same in both models. Generally one would say that model M2 is preferred over M1 if the probability of obtaining F 12 is smaller than some selected amount, like 5% or 1%. 6

7 Two examples of F distributions and the locations of the 95% and 99% probability areas. PDF of F F Distribution N dof1 = 9, N dof2 = 8 F95 = F99 = PDF of F F Distribution N dof1 = 99, N dof2 = 98 F95 = F99 = CDF of F CDF of F F F 7

8 Bayesian Approach For the Bayesian approach we calculate the likelihood function L(θ) and multiply by a prior for the parameters to get the posterior PDF: P (θ DI) = P (θ I)L(θ) dθ P (θ I)L(θ) From the posterior PDF we obtain the M-dimensional PDF for the parameters. This has a different interpretation than the frequentist approach: the posterior PDF expresses our uncertainty about the parameters for a specific data set and given background and prior information. We compare models using the odds ratio ( 3.5 of Gregory and notes on website, Bayesian Model Comparison. Here we assume that the two models have equal priors (we have no a priori preference) and thus concentrate on the Bayes factor B 21 = L(M 2) L(M 1 ) = ratio of global likelihoods of M 2,1 The global likelihoods are just the denominators in the posteriod PDFs, so B 21 = L(M 2) dθ P L(M 1 ) = (θ M2 ) L(θ D, M 2 ) dθ P (θ M1 ) L(θ D, M 1 ) Use flat priors with width for M1: a 1 and for M2: a 2 b 2. Also assume the likelihood functions are narrower than the priors and have widths for M1: δa 1 and for M2: δa 2 δb 2. Then B 21 L( ˆθ ( ) ( ) 2 D, M 1 ) δa2 δb 2 a1 L( ˆθ 1 D, M 1 ) a 2 b 2 δa 1 For M2 to be superior to M1 (higher odds ratio), the likelihood function has to be sufficiently large to offset the penalty of the extra parameter contained in the larger volume in parameter space. 8

9 Bayesian Approach (continued) For the particular M1 and M2 models (constant vs. line fits), the ratio of likelihoods is just L( ˆθ 2 D, M 1 ) L( ˆθ 1 D, M 1 ) = e χ2 2 /2 e = χ1 2 /2 e(χ1 1 χ1)2 2 = e χ2 /2 and the Bayes factor becomes ( ) ( ) B 21 e χ2 /2 δa2 δb 2 a1 a 2 b 2 δa 1 9

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Lecture 12 Applications: Model comparison Some Least-squares lessons