A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University

Size: px

Start display at page:

Download "A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University"

Roy Holt
5 years ago
Views:

1 A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University Lecture 19 Modeling Topics plan: Modeling (linear/non- linear least squares) Bayesian inference Bayesian approaches to spectral esbmabon; also prewhitening methods OpBmizaBon methods (needed for posterior PDFs, Bayes factors) Reading: Ch 10, 11 and 3 in Gregory For next week: Assignment 3a = group Bayesian project References: Webpage: 1 ASTRONOMY 6523 Spring 2017 Problem Set 3a This is a class project in its initial setup stages but each of you should do the numerical part individually. Here is the assignment in general terms: 1. Assemble a data set 2. Pose two (or more!) hypotheses (models) for the data 3. Set up the Bayesian inference problem for each model, i.e. set up the posterior PDF using reasonable prior PDFs for each model 4. Compare models using the Bayesian odds ratio The method you should use is described in Chapter 3 of Gregory and in particular Section 3.5 about model comparison. The particulars: Construct a data set comprising birthdays of your peers. Express these as a day number of the year. If you like, you could express this as a phase in the interval [0, 1]. (Don t get hung up on leap days). I recommend that each of you get the birthdays of 10 of your family and friends, etc. Than pool them together into an aggregate data set. Given the data set compare two hypothesis about the occurrence of birthdays. As just an example, you might have these two hypotheses: i) H1: that birthdays occur uniformly through the year. ii) H2: that birthdays occur according to a PDF that is a constant + sinusoidal part; this is a three parameter model (constant, amplitude and phase of the sinusoid), which can be challenging. You may want to come up with some two-parameter distribution. You should begin your analysis by calculating, plotting and interpreting (in words) a histogram of the birthdays. You will have to choose a suitable binning interval. You should also calculate, plot and analyze the CDF of the data. For H2, calculate the marginalized PDFs and the 95% confidence for each parameter. Compare the two hypotheses by calculating the odds ratio O12 = P (H1 DI) P (H2 DI). Which hypothesis is favored by the data? 2 1

2 Linear Least Squares Least Squares Fitting Consider a general model for an observable quantity: Data n data points = model for observable quantity theory, k parameters + additive errors PDF possibly known For least squares, we need to know just the 1st and 2nd moments of the error PDF but for maximum likelihood analysis we need to know the PDF. Symbolically, y i = k j=1 j X ij + i,i=1,...,n 7 In vector notation: where y = X Excluding the noise part (n equations, k unknowns) y = n 1vector X = n k matrix = k 1vector = n 1vector Nomenclature: X is the design matrix is the parameter vector 8 2

3 Example: A parabolic model for a times series y i with errors i : y i = t i + 3 t 2 i + i. parabola The model is linear in the parameters even though it is nonlinear in the independent variable, t i. The design matrix and parameter vector are: X = 1 t 1 t2 1 1 t 2 t 2 2. = 1 t n t 2 n Solution 1: No Errors Suppose there are no errors. Then we must solve y = X,a general class of matrix problems. The circumstances in which there are solutions to the problem depend on the rank of the matrix X and on the rank of the augmented matrix [X y]. If X is square with non-zero determinant, then the solution is simply = X 1 y. Usually, however, we (better) have more data points than parameters (n > k or n k). Then X is rectangular. For rectangular matrices we cannot find the simple inverse of X. But if det(x X) = 0, then X X has an inverse. 10 3

4 Therefore the solution to y = X is found by premultiplying by X X y = X X and then multiplying by the inverse of X X (X X) 1 X y =(X X) 1 (X X) = which yields the unique solution =(X X) 1 X y 11 Solution 2: The case with measurement errors The actual problem is y = X + ŷ + The errors break down the uniqueness of the solution. There is possibly an infinite number of solutions for. In some sense we want the best estimate, according to criteria we have already discussed. Typical situation: n k: There are many more data points than parameters solvable problem. We now obtain an estimate for ˆ for based on least squares. By estimating we are also, in effect, estimating the errors, = y X. We want to minimize the errors in a statistical sense. 12 4

5 Therefore, we minimize the inner product (a scalar). n n Q() = 2 j (y ŷ) 2 j=1 j=1 We can write this as a quadratic form using the identity matrix I Q() = I We have put the identity matrix into the equation. We can put any Toeplitz matrix here to get a general quadratic form and we will see later (for weighted least squares) that the covariance matrix will appear in this form. 13 Solution: Q( )=(y X ) (y X ) =[y (X ) ](y X ) = y y (X ) y X y y X +(X ) X X y X X Note (y ) = y and transpose of scalar = scalar Thus Q( )=y y 2 X y + (X X) Now minimize w.r.t. to get estimator ˆ: dq d =(vectorofdq d i,i=1,k)= gradient 14 5

6 Thus and, if the inverse exists, dq d =ˆ =0 2 X y +2(X X)ˆ =0 (X X)ˆ = X y ˆ =(X X) 1 X y 15 This solution is the same answer as for the error-less case. But here, we have an estimate that is not unique; it is one among many models that may be consistent with the data; it just happens to be the one with the minimum least-squares error over an ensemble. For a given data set (a specific realization), the best parameter set may differ from the one that gives the least-squares error. Notes: when the errors have Gaussian statistics, the least-squares solution is identical to the maximum-likelihood solution. We can also write the equation to be solved as the normal equations, where normal here means orthogonal: X (y Xˆ) =X (y ŷ) residuals R = X R R X = 0. inner product Thus the residuals R, which are the errors in estimating the data, are orthogonal to the columns of X. 16 6

7 Matrices: matrix rank = dimension of largest square submatrix with determinent = 0. (an n n matrix with det = 0 has rank <nand is said to be singular) Augmented matrix: [X y] Let r = rank X and r aug = rank [X y] Then if i. r aug >rno solution (we will assume we never have this case) ii. r aug = r = k = no. of unknowns ( ) one solution iii. r aug = r<k can solve for r unknowns after assigning arbitrary values to k r of the unknowns More specifically, we can have 1. n<krank of X = r n<k infinite number of solutions (not enough equations for number of unknowns) n = k (square matrix X) Now a possibility is r = k if det X = 0 A 1 exists unique solution = A 1 y 3. n>k r = rank of X k (more data points than parameters): r = k is again possible but now X does not have an inverse because it is not square. However, the matrix X X = (k n) (n k) = k k matrix (square). (where simply means transpose) has an inverse if det(x X) = 0. Derivatives: We use the results and d db (b c)=c d db (b Ab) =2Ab if A = A 18 7

8 Linear Least Squares: Parameter Errors First consider the special case where errors in the data,, have a diagonal covariance matrix with all diagonal entries equal: = 2 I, where I is the identity matrix. [Note that is an n 1 matrix so the covariance matrix,, is an n n matrix.] We want to find the estimation errors in the parameters ˆ. Let P (ˆ )(ˆ ), which is a k k matrix of correlation values between the different parameters; i.e. this is the covariance matrix of the parameters. We have ˆ =(X X) 1 X y, as before. Substituting y = X + we find that ˆ = +(X X) 1 X. 19 Defining we find that B X X, ˆ = B 1 X. Therefore the covariance matrix of the parameters is where we have used P = (B 1 X )(B 1 X ) = (B 1 X )( XB 1 ) = B 1 X XB 1 = B 1 X 2 IXB 1 = 2 B 1 = 2 (X X) Should be (XX)^-1 (X ) X 20 8

9 which implies Thus, B B and (B 1 ) B 1. P = 2 (X X) 1. The error in each parameter with respect to the true parameter value is j P jj = X 1 1/2 X jj and the correlation coefficient between the two parameters is j k P jk P jk =. j k Pjj P kk In the ideal case, the parameters would be uncorrelated, so j k =0 forj = k 21 Modeling Examples 9

10 Least Squares Examples I. A polynomial model for a times series y i with errors i : k k y i = X ij j + i = t j 1 i j + i. j=1 j=1 The model is linear in the parameters even though it is nonlinear in the independent variable, t i. If the polynomial order is p, then k = p +1and The design matrix and parameter vector are: 1 t 1 t 2 1 t p 1 1 t X = 2 t 2 2 t p 2. 1 t n t 2 n t p n = 1 2. p+1 Define T k = n t k j and tk y = i n t k i y i i=1 Need 1/n in front of last sum 1 Then the product of the design matrix with itself is the k k = (p +1) (p +1)matrix and T 0 T 1 T 2 T p T 1 T 2 T 3 T p+1 X X = T 2 T 3 T 4 T p T p T p+1 T p+2 T 2p X y = X y 1 y 2. y n = n y i i=1 n y t i y i tỵ n.. t n p y t p i y i The least-squares solution ˆ =(X X) 1 X y requires the inverse of X X that will exist if the determinant is nonzero. i=1 i=1 2 10

11 First-order polynomial: something we can easily solve. y i = t i X X = T0 T 1 T 1 T 2 X 1 1 X = det(x (matrix of cofactors) X) 1 T2 T = 1 (T 0 T 2 T1 2) T 1 T 0 3 Then ˆ =(X X) 1 X y = n (T 0 T 2 T 2 1 ) T2 T 1 T 1 T 0 y ty = n (T 0 T 2 T1 2) yt 2 tyt 1 yt 1 + tyt 0 yt n 2 tyt 1 (nt 2 T1 2) yt 1 + tyn So the individual parameters are ˆ1 = n yt 2 tyt 1 (nt 2 T1 2) and ˆ2 = n nty yt 1 (nt 2 T1 2). 4 11

12 Assuming the errors i are stationary and statistically independent with variance 2 i = 2, the covariance matrix of the parameters is where P = 2 X X = T 2 = nt 2 T1 2 n = nt 2 T1 2 1/2 1/2 = 1 2 T = nt2 (negatively correlated) 5 For n 1 and uniform sampling t i = i, i =1,...,n, so 1 2 n 1 2 T 1 n 2 /2 and T 2 n 3 /3 n 2 /2 n n 3 /3 = 2 12 n 3/ (highly anticorrelated) The anticorrelation means that any error in one parameter is compensated by the error in the other. 6 12

13 Better parameterization for the first-order polynomial: orthogonal polynomials. E.g. y i = (t i t) where t = 1 n t n i i=1 Now the design matrix and the various products are 1 t 1 t 1 t X = 2 t, X T0 T X = 1. T 1 T 2 1 t n t and the solution is now T0 0 =, X X 1 1 T2 0 = 0 T 2 T 0 T 2 0 T 0 n(t t)y 1 = y, 2 = T 2 The errors on 1,2 are the same but the parameters are now uncorrelated, 1 2 =0. 7 II. Sinusoids Consider the linear model y = X + where X comprises complex exponentials and the parameter vector comprises Fourier amplitudes: X nm = e 2inm/N, n =0,...,N 1, m =0,...,k 1 WN nm where W N e 2i/N is the N th root of 1 on the unit circle. for k = N we have W N WN k X = 1 WN 2 WN 2k WN N 1 W N (N 1)k W N W N 1 N k = N 1 W 2 2(N 1) N W N WN N 1 (N 1)2 W N 8 13

14 The product matrix is s 0 s 1 s N 1 X s 1 s 0 s N 2 X = s 2 s 1 s 0 s N s N 1 s N 2 s 0 where s p N 1 W pj N. j=0 The off-diagonal terms all sum to zero because the sums are over integer multiples of the periods of the complex sinusoids. Therefore X X = NI and (X X) 1 = N 1 I. 9 We also have N 1 y j j=0 N 1 W j N yj j=0 X y = N 1 W 2j N yj j=0. N 1 (N 1)j W N y j j=0 The least-squares coefficients are then ˆ =(X X) 1 X y = N 1 X y which is just the DFT of y expressed in vector form. The parameter error vector is P = 2 (X X) 1 = N 1 2 I

15 Example of a bad model: Consider y i = x i + a(sin 3 x i ) linearize x i + ax i (cos 3 x i ) 3 where the parameters of the linearized function are 1, 2, 3 = a 3. The design matrix and product matrix are 1 x 1 x 1 cos 3 x 1 1 x X = 2 x 1 cos 3 x 2. 1 x N x 1 cos 3 x N N X i x i i x i cos 3 x i X = i x i i x2 i i x2 i cos 3x i i x i cos 3 x i i x2 i cos 3x i i x2 i (cos 3x i ) 2 For the case where 3 x i 1 for all x i, the elements involving cos 3 x i 1 ( 3 x i ) 2 /2 so for very small 3 x i the cosine factors will be very close to unity. The elements in the matrix are then degenerate with neighboring elements because the sine term in the model is degenerate with the linear term. In this case the design matrix is ill-conditioned and the determinant of X X 0. For cases like this the fitting function should be redefined or singular value decomposition may be used

Frequentist-Bayesian Model Comparisons: A Simple Example

Frequentist-Bayesian Model Comparisons: A Simple Example Consider data that consist of a signal y with additive noise: Data vector (N elements): D = y + n The additive noise n has zero mean and diagonal