EFFICIENT CALCULATION OF FISHER INFORMATION MATRIX: MONTE CARLO APPROACH USING PRIOR INFORMATION. Sonjoy Das

Similar documents
Efficient Monte Carlo computation of Fisher information matrix using prior information

Cramér Rao Bounds and Monte Carlo Calculation of the Fisher Information Matrix in Difficult Problems 1

Improved Methods for Monte Carlo Estimation of the Fisher Information Matrix

Recent Advances in SPSA at the Extremes: Adaptive Methods for Smooth Problems and Discrete Methods for Non-Smooth Problems

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Monte Carlo Computation of the Fisher Information Matrix in Nonstandard Settings

Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes

1216 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE /$ IEEE

A Few Notes on Fisher Information (WIP)

Submitted to the Brazilian Journal of Probability and Statistics

Economics 583: Econometric Theory I A Primer on Asymptotics

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Basic Sampling Methods

Brief Review on Estimation Theory

ECE 275A Homework 6 Solutions

ECE 275A Homework 7 Solutions

STA 294: Stochastic Processes & Bayesian Nonparametrics

Sampling distribution of GLM regression coefficients

Answers to Selected Exercises in Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

DA Freedman Notes on the MLE Fall 2003

Large Sample Properties of Estimators in the Classical Linear Regression Model

MATH 2030: MATRICES ,, a m1 a m2 a mn If the columns of A are the vectors a 1, a 2,...,a n ; A is represented as A 1. .

Intelligent Embedded Systems Uncertainty, Information and Learning Mechanisms (Part 1)

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

The properties of L p -GMM estimators

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

Lecture 7 Introduction to Statistical Decision Theory

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics

Review Quiz. 1. Prove that in a one-dimensional canonical exponential family, the complete and sufficient statistic achieves the

Parameter Estimation

Fisher Information Matrix-based Nonlinear System Conversion for State Estimation

Econometrics I, Estimation

Estimation Theory Fredrik Rusek. Chapters

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Inferences about Parameters of Trivariate Normal Distribution with Missing Data

Random Eigenvalue Problems Revisited

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

5601 Notes: The Sandwich Estimator

2 Statistical Estimation: Basic Concepts

5682 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE

Dynamic System Identification using HDMR-Bayesian Technique

ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Concentration Ellipsoids

Maximum Likelihood Estimation

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Statistics and econometrics

Linear Algebra. The analysis of many models in the social sciences reduces to the study of systems of equations.

i=1 h n (ˆθ n ) = 0. (2)

Maximum Likelihood Tests and Quasi-Maximum-Likelihood

COMPLEX CONSTRAINED CRB AND ITS APPLICATION TO SEMI-BLIND MIMO AND OFDM CHANNEL ESTIMATION. Aditya K. Jagannatham and Bhaskar D.

Invariant HPD credible sets and MAP estimators

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Variational Autoencoder

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Lawrence D. Brown* and Daniel McCarthy*

Monte Carlo in Bayesian Statistics

Statistics: Learning models from data

Learning gradients: prescriptive models

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

A Primer on Asymptotics

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

Maximum Likelihood Estimation

STA 4273H: Statistical Machine Learning

ECE531 Lecture 10b: Maximum Likelihood Estimation

1. Fisher Information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

1 Matrices and matrix algebra

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Lecture 1: Introduction

Inferring from data. Theory of estimators

The Matrix Algebra of Sample Statistics

4. Distributions of Functions of Random Variables

Estimation, Detection, and Identification

Chapter 3. Point Estimation. 3.1 Introduction

Greene, Econometric Analysis (6th ed, 2008)

Graduate Econometrics I: Unbiased Estimation

ACCURATE FREE VIBRATION ANALYSIS OF POINT SUPPORTED MINDLIN PLATES BY THE SUPERPOSITION METHOD

Advanced Signal Processing Introduction to Estimation Theory

Math 494: Mathematical Statistics

A Very Brief Summary of Statistical Inference, and Examples

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

The outline for Unit 3

A matrix over a field F is a rectangular array of elements from F. The symbol

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

A Very Brief Summary of Statistical Inference, and Examples

Covariance function estimation in Gaussian process regression

Matrices and Vectors

Discussion of Maximization by Parts in Likelihood Inference

Asymptotic and finite-sample properties of estimators based on stochastic gradients

4 Derivations of the Discrete-Time Kalman Filter

Statistics and Data Analysis

Regression. Oscar García

Stochastic Analogues to Deterministic Optimizers

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Stochastic Processes

Transcription:

EFFICIENT CALCULATION OF FISHER INFORMATION MATRIX: MONTE CARLO APPROACH USING PRIOR INFORMATION by Sonjoy Das A thesis submitted to The Johns Hopkins University in conformity with the requirements for the degree of Master of Science in Engineering. Baltimore, Maryland May, 2007 c Sonjoy Das 2007 All rights reserved

Abstract The Fisher information matrix (FIM) is a critical quantity in several aspects of mathematical modeling, including input selection and confidence region calculation. Analytical determination of the FIM in a general setting, specially in nonlinear models, may be difficult or almost impossible due to intractable modeling requirements and/or intractable high-dimensional integration. To circumvent these difficulties, a Monte Carlo (MC) simulation-based technique, resampling algorithm, based on the values of log-likelihood function or its exact stochastic gradient computed by using a set of pseudo data vectors, is usually recommended. The current work proposes an extension of the resampling algorithm in order to enhance the statistical qualities of the estimator of the FIM. This modified resampling algorithm is useful in those cases where the FIM has a structure with some elements being analytically known from prior information and the others being unknown. The estimator of the FIM, obtained by using the proposed algorithm, simultaneously preserves the analytically known elements and reduces variances of the estimators of the unknown elements by capitalizing on information contained in the ii

known elements. Numerical illustrations considered in this work show considerable improvement of the new FIM estimator (in the sense of mean-squared error reduction as well as variance reduction) based on the modified MC based resampling algorithm over the FIM estimator based on the current MC based resampling algorithm. Advisor: Prof James C. Spall. The Johns Hopkins University. iii

Acknowledgements I would like to express my sincere gratitude and appreciation to my advisor, Professor James C. Spall, for his insightful suggestions, technical guidance, patience, support and encouragement throughout the course of this work. I also like to formally express my sincere thanks and appreciation to the advisor of my doctoral program, Professor Roger Ghanem, at the Department of Civil and Environmental Engineering, University of Southern California, for his constant support and allowing me to continue this work within the already busy schedule of my doctoral work, without which this work would definitely have not been completed. In addition, I thank Professor Daniel Q. Naiman for his support to my endeavor for the successful completion of this work. I am thankful to Ms. Kristin Bechtel, Academic Program Coordinator of the department, for her constant help in numerous department and university formalities essential for completion of this work. iv

Contents Abstract Acknowledgements List of Tables List of Figures ii iv vii viii 1 A Glimpse of Fisher Information Matrix 1 1.1 Motivating Factors............................ 2 1.2 Fisher Information Matrix: Definition and Notation.......... 4 1.3 Estimation of Fisher Information Matrix................ 5 1.3.1 Current Re-sampling Algorithm: No use of Prior Information. 6 2 Improved Re-sampling Algorithm 14 2.1 Additional Notation........................... 15 2.2 Modified Re-sampling Algorithm using Prior Information....... 16 2.3 Theoretical Basis for the Modified Re-sampling Algorithm...... 20 2.3.1 Case 1: only the measurements of L are available....... 20 2.3.2 Case 2: measurements of g are available............ 37 3 Numerical Illustrations and Discussions 46 3.1 Example 1................................. 46 3.2 Example 2................................. 48 3.3 Discussions on results........................... 52 4 Conclusions and Future Research 57 4.1 Summary of Contributions Made.................... 58 4.2 Suggestions for Future Research..................... 58 Bibliography 60 v

Vita 63 vi

List of Tables 3.1 Example 1 MSE and MSE reduction of FIM estimates based on N = 2000 pseudo data (results on variance are reported within parentheses). 54 3.2 Example 1 MSE comparison for ˆF n and F n for only the unknown elements of F n (θ) according to F (given) n based on N = 100000 pseudo data (similar results on variance are reported within parentheses, A = p i=1 j I var[ ˆF c θ] and B = p i i=1 j I var[ F c θ])......... 55 i 3.3 Example 2 MSE and MSE reduction of FIM estimates based on N = 40000 pseudo data (results on variance are reported within parentheses)................................... 55 3.4 Example 2 MSE comparison for ˆF n and F n for only the unknown elements of F n (θ) according to F (given) n based on N = 40000 pseudo data (similar results on variance are reported within parentheses A = p i=1 j I var[ ˆF c θ] and B = p i i=1 j I var[ F c θ])......... 55 i 3.5 Example 1 Variance comparison between theoretical results (correct up to a bias term as discussed in Section 2.3) and simulation results (with N = 2000)............................. 56 vii

List of Figures 1.1 Fisher information matrix with known elements as marked; void part consists of unknown elements....................... 3 1.2 Schematic model for forming FIM estimate, F M,N ; adapted from [13]. 10 2.1 Schematic of algorithm for computing the new FIM estimate, F n... 19 3.1 F (given) 30 for Example 2 with known elements as marked; void part consists of unknown elements to be estimated by using MC based resampling algorithms.............................. 50 viii

Chapter 1 A Glimpse of Fisher Information Matrix The Fisher information matrix (FIM) plays a key role in estimation and identification [12, Section 13.3] and information theory [3, Section 17.7]. A standard problem in the practical application and theory of statistical estimation and identification is to estimate the unobservable parameters, θ, of the probability distribution function from a set of observed data set, {Z 1,, Z n }, drawn from that distribution [4]. The FIM is an indicator of the amount of information contained in this observed data set about the unobservable parameter, θ. If ˆθ is an unbiased estimator of θ representing the true value of θ, then the inverse of the FIM computed at θ provides the Cramér-Rao lower bound of the covariance matrix of ˆθ, implying that the difference between cov(ˆθ) and the inverse of the FIM computed at θ is a positive semi-definite 1

matrix. Therefore, the name information matrix is used to indicate that a larger FIM (in the matrix sense positive semi-definiteness as just mentioned) is associated with a smaller covariance matrix (i.e., more information), while a smaller FIM is associated with a larger covariance matrix (i.e., less information). Some important areas of applications of FIM include, to name a few, confidence interval computation of model parameter [2, 7, 16], configuration of experimental design involving linear [9] and nonlinear [12, Section 17.4] models, and determination of noninformative prior distribution (Jeffreys prior) for Bayesian analysis [8]. 1.1 Motivating Factors However, analytical determination of the FIM in a general setting, specially in nonlinear models [16], may be a formidable undertaking due to intractable modeling requirements and/or intractable high-dimensional integration. To avoid this analytical problem, a computational technique based on Monte Carlo simulation technique, called resampling approach [12, Section 13.3.5], [13], may be employed to estimate the FIM. In practice, there may be instances when some elements of the FIM are analytically known from prior information while the other elements are unknown (and need to be estimated). In a recent work [4, 5], a FIM of size 22 22 was observed to have the structure as shown in Figure 1.1. In such cases, the above resampling approach, however, still yields the full FIM without taking any advantage of the information contained in the analytically known 2

1 4 8 12 16 20 22 1 4 8 12 16 20 22 Known elements = 54 Figure 1.1: Fisher information matrix with known elements as marked; void part consists of unknown elements. elements. The prior information related to known elements of the FIM is not incorporated while employing this algorithm for estimation of the unknown elements. The resampling-based estimates of the known elements are also wasted because these estimates are simply replaced by the analytically known elements. The issue yet to be examined is whether there is a way of focusing the averaging process (required in the resampling algorithm) on the elements of interest (unknown elements that need to be estimated) that is more effective than simply extracting the estimates of those elements from the full FIM estimated by employing the existing resampling algorithm. The current work presents a modified and improved (in the sense of variance reduction) version of the resampling approach for estimating the unknown elements of the FIM by borrowing the information contained in the analytically known elements. The final outcome exactly preserves the known elements of the FIM and 3

produces variance reduction for the estimators of the unknown elements that need to be estimated. The main idea of the current work is presented in chapter 2.2 and the relevant theoretical basis is highlighted in section 2.3. The proposed resampling algorithm, that is similar in some sense to the one for Jacobian/Hessian estimates presented earlier [14], modifies and improves the current resampling algorithm by simultaneously preserving the known elements of the FIM and yielding better (in the sense of variance reduction) estimates of the unknown elements. The effectiveness and efficiencies of the new resampling algorithm is illustrated by numerical examples in chapter 3. Chapter 4 infers the conclusions from the current work. 1.2 Fisher Information Matrix: Definition and Notation Consider a set of n random data vector {Z 1,, Z n } and stack them in Z n, i.e., Z n = [Z T 1,, Z T n] T. Here, the superscript T is transpose operator. Let the multivariate joint probability density or mass (or hybrid density/mass) function (pdf) of Z n be denoted by p Zn ( θ) that is parameterized by a vector of parameters, θ = [θ 1,, θ p ] T Θ R p in which Θ is the set consisting of allowable values for θ. The likelihood function of θ is then given by l(θ Z n ) = p Zn (Z n θ) and the associated log-likelihood function L by L(θ Z n ) ln l(θ Z n ). 4

Let us define the gradient vector g of L by g(θ Z n ) = L(θ Z n )/ θ and the Hessian matrix H by H(θ Z n ) = 2 L(θ Z n )/ θ θ T. Here, ( )/ θ is the p 1 column vector of elements ( )/ θ i, i = 1,, p, ( )/ θ T = [ ( )/ θ] T, and 2 ( )/ θ θ T is the p p matrix of elements 2 ( )/ θ i θ j, i, j = 1,, p. Then, the p p FIM, F n (θ), is defined [12, Section 13.3.2] as follows, F n (θ) E [ g(θ Z n ) g T (θ Z n ) θ ] = E [H(θ Z n ) θ], (1.1) provided that the derivatives and expectation (the expectation operator E is with respect to the probability measure of Z n ) exist. The equality = in (1.1) is followed [1, Proposition 3.4.4], [12, p. 352 353] by assuming that L is twice differentiable with respect to θ and the regularity conditions [1, Section 3.4.2] hold for the likelihood function, l. The Hessian-based form above is more amenable to the practical computation for FIM than the gradient-based form that is used for defining the FIM. It must be remarked here that there might be cases when the observed FIM, H(θ Z n ) 2 L(θ Z n )/ θ θ T, at a given data set, Z n, is preferable to the actual FIM, F n (θ) = E[H(θ Z n ) θ], from an inference point of view and/or computational simplicity (see e.g., [6] for scalar case, θ = θ 1 ). 1.3 Estimation of Fisher Information Matrix The analytical computation of F n (θ), in a general case, is a daunting task since it involves evaluation of the first or second order derivative of L and expectation of 5

the resulting multivariate, often non-linear, function that become difficult or next to impossible. This analytical problem can be circumvented by employing a recently introduced simulation-based technique, resampling approach [12, Section 13.3.5], [13]. This algorithm is essentially based on producing a set of large (say N) number of Hessian estimates from either the values of the log-likelihood function or (if available) its exact stochastic gradient that, in turn, are computed from a set of pseudo data vector, {Z pseudo (1),, Z pseudo (N)}, with each Z pseudo (i), i = 1,, N, being digitally simulated from p Zn ( θ) and statistically independent of each other. The set of pseudo data vector acts as a proxy for the observed data set in the resampling algorithm. The average of the negative of these Hessian estimates is reported as an estimate of the F n (θ). A brief outline of the existing resampling algorithm is provided in the following subsection. 1.3.1 Current Re-sampling Algorithm: No use of Prior Information For i-th pseudo data, let the k-th estimate of the Hessian matrix, H(θ Z pseudo (i)), in the resampling algorithm, be denoted by Ĥ(i) k. Then, Ĥ(i), as per resampling k scheme, is computed as [13], Ĥ (i) k = 1 δg (i) [ k 1 k1 2 2 c,, ] (δg (i) [ 1 kp + k 1 k1 2 c,, ] ) T 1 kp, (1.2) 6

in which c > 0 is a small number, δg (i) k G(θ+c k Z pseudo (i)) G(θ c k Z pseudo (i)) and the perturbation vector k = [ k1,, kp ] T is a user-generated random vector statistically independent of Z pseudo (i). The random variables, k1,, kp, are mean-zero and statistically independent and, also the inverse moments, E[ 1/ km ], m = 1,, p, are finite. The symmetrizing operation (the multiplier 1/2 and the indicated sum) as shown in (1.2) is useful in optimization problems to compute a symmetric estimate of the Hessian matrix with finite samples [11]. This also maintains a symmetric estimate of F n (θ), which itself is a symmetric matrix. Depending on the setting, G( Z pseudo (i)), as required in δg (i) k, represents the k- th direct measurement or approximation of the gradient vector g( Z pseudo (i)). If the direct measurement or computation of the exact gradient vector g is feasible, G(θ ± c k Z pseudo (i)) represent the direct k-th measurements of g( Z pseudo (i)) at θ ± c k. If the direct measurement or computation of g is not feasible, G(θ ± c k Z pseudo (i)) represents the k-th approximation of g(θ ± c k Z pseudo (i)) based on the values of L( Z pseudo (i)). If the direct measurements or computations of g are not feasible, G in (1.2) can be computed by using the classical finite-difference (FD) technique [12, Section 6.3] or the simultaneous perturbation (SP) gradient approximation technique [10], [12, Section 7.2] from the values of L( Z pseudo (i)). For the computation of gradient approximation based on the values of L, there are advantages to using one-sided [12, 7

p. 199] SP gradient approximation (relative to the standard two-sided SP gradient approximation) in order to reduce the total number of function measurements or evaluations for L. The SP technique for gradient approximation is quite useful when p is large and usually superior to FD technique for estimation of F n (θ) when the resampling approach is employed to estimate the FIM. The formula for the one-sided gradient approximation using SP technique is given by, G (1) (θ ± c k Z pseudo (i)) [ = (1/ c) L(θ + c ] k ± c k Z pseudo (i)) L(θ ± c k Z pseudo (i)) 1 k1. 1 kp, (1.3) in which superscript (1) in G (1) indicates that it is one-sided gradient approximation (G = G (1) ), c > 0 is a small number and k = [ k1,, kp ] T is generated in the same statistical manner as k, but otherwise statistically independent of k and Z pseudo (i). It is usually recommended that c > c. Note that G (1) ( ) in (1.3) not only depends on θ, c, k and Z pseudo (i) but also depends on c and k that, however, have been suppressed for notational clarity. At this stage, let us also formally state that the perturbation vectors, k and k, satisfy the following condition [10, 11], [12, Chapter 7]. C.1: (Statistical properties of the perturbation vector) The random variables, km (and km ), k = 1,, N, m = 1,, p, are statistically independent and almost surely (a.s.) uniformly bounded for all k, m, and, are also 8

mean-zero and symmetrically distributed satisfying E[ 1/ km ] < (and E[ 1/ km ] < ). If the pseudo data vectors are expensive to simulate relative to the Hessian estimate, then it is preferred to generate several Hessian estimates by generating more than one (say, M > 1) perturbation vectors, 1,, M, for each pseudo data vector, Z pseudo (i), and, thus, computing M Hessian estimates, Ĥ (i) 1,, Ĥ(i), by employing (1.2) using these M perturbation vectors and the pseudo data vector, Z pseudo (i). Let the average of, say, k Hessian estimates, Ĥ (i) 1,, Ĥ(i), for the i-th pseudo data, k M Z pseudo (i), be denoted by (i) (i) H k. The computation of H k can be conveniently carried out by using the standard recursive representation of sample mean in contrast to storing the matrices and averaging later. The recursive representation of H (i) k is given by, H (i) k = k 1 H (i) k 1 k + 1 k Ĥ(i) k, k = 1,, M, (1.4) in which at the first step (k = 1), H (i) 0 in the right-hand-side can be set to a matrix of zeros or any arbitrary matrix of finite-valued elements to obtain the correct result. If the sample mean of total M Hessian estimates is now denoted by H (i) H (i) M (for the sake of simplification of notation), then an estimate, F M,N, of F n (θ) is computed by averaging H (1),, H (N) and, taking the negative value of the resulting average. A recursive representation identical to (1.4) for computing the sample mean of H (1),, H (N) is also recommended to avoid storage of the matrices. The current resampling algorithm is schematically shown in Figure 1.2. 9

Pseudo data Input Hessian estimates ^ ( i ) H k Average ( i ) of H^ = i k H ( ) Z pseudo (1). Z pseudo ( ) ^ ( 1) ^H ( 1) H 1,.....,.. M ^ ( N ) N ) H ^H( 1 M N,....., H (1).. ( ) H N ( i ) H Negative average of FIM estimate, F M,N Figure 1.2: Schematic model for forming FIM estimate, F M,N ; adapted from [13]. However, it should be noted that M = 1 has certain optimality properties [13] and it is assumed throughout the work presented in this thesis that for each pseudo data vector, only one perturbation vector is generated and, thus, only one Hessian estimate is computed. The current work can, however, be readily extended to the case when M > 1. Therefore, from now on, the index of the pseudo data vector will be changed from i to k. Consequently, the pseudo data vector will be denoted by Z pseudo (k), k = 1,, N, the difference in gradient and the Hessian estimate in (1.2), respectively, will simply be denoted by δg k and Ĥk, k = 1,, N, and the notation of the one-sided gradient approximation in (1.3) will take the form of G (1) (θ ± c k Z pseudo (k)). Finally, the following simplification of notation for the estimate of the FIM will also be used from now on, ˆF n F 1,N. (1.5) 10

The set of simplified notations as just mentioned essentially conforms to the work in [12, Section 13.3.5] that is a simplified version of [13] in which the resampling approach was rigorously presented. Let us also assume that the moments of km and 1/ km (and, of km and 1/ km ) up to fifth order exist (this condition will be used later in Section 2.3. Since km (and km ) is symmetrically distributed, 1/ km (and 1/ km ) is also symmetrically distributed implying that I: (Statistical properties implied by C.1) All the odd moments of km and 1/ km (and of km and 1/ km ) up to 5-th order are zeros, E[( km ) q ] = 0 and E[(1/ km ) q ] = 0 (E[( km ) q ] = 0 and E[(1/ km ) q ] = 0), q = 1, 3, 5. The random vectors k (and k ) are also independent across k. The random variables k1,, kp (and k1,, kp ) can also be chosen identically distributed. In fact, independent and identically distributed (i.i.d.) (across both k and m) meanzero random variable satisfying C.1 is a perfectly valid choice for km (and km ). In particular, Bernoulli ±1 random variable for km (and km ) is a valid but not the necessary choice among other probability distributions satisfying C.1. It must be noted from (1.2) and (1.3) that the current resampling algorithm would will satisfactorily provided the following condition is satisfied. C.2: (Smoothness of L) The log-likelihood function, L, is twice continuously differentiable with respect to θ and all the possible second order derivatives are a.s. (a.s. with respect to the probability measure of Z pseudo (k)) continuous 11

and uniformly (in k) bounded θ S θ Θ R p in which S θ is a small region containing ˆθ that represents the value of θ at which the FIM needs to estimated. (For example, if kl ρ and kl ρ, k, l, for some small numbers ρ, ρ > 0, then it is sufficient to consider S θ as a hypercube with length of its side being 2(cρ + c ρ) and center being ˆθ in R p.) In all the gradient approximation techniques as mentioned earlier (FD technique, two-sided SP technique and one-sided SP technique), the gradient approximation, G(θ Z pseudo (k)), has O Z ( c 2 ) bias provided the following is satisfied. C.2 (Smoothness of L) The log-likelihood function, L, is thrice continuously differentiable with respect to θ and all the possible third order derivatives are a.s. continuous and bounded θ S θ. The exact form of the bias terms are, however, different for different techniques [12, Section 6.4.1, 7.2.2]. The subscripts in O Z ( ) explicitly indicate that the big-o term depends on Z pseudo (k). This term is such that O Z ( c 2 )/ c 2 < a.s. as c 0. The explicit expression of bias, when one-sided SP technique is employed, is reported in Section 2.3.1. In the case where the direct measurements or computations of g are feasible, G is an unbiased gradient estimate (measurement) of g. It is shown in the next chapter that the FIM estimate in (1.5) is accurate within an (O(c 2 ) + O( c 2 )) bias in the case where only the measurements of the log-likelihood function L are available and within an O(c 2 ) bias in the case where the measurements 12

of the unbiased exact stochastic gradient vector g are available provided the following condition (stronger than C.2 and C.2 ) is satisfied. C.3: (Smoothness of L) The log-likelihood function, L, is four times continuously differentiable with respect to θ and all the possible fourth order derivatives are a.s. continuous and uniformly (in k) bounded θ S θ. Here, the big-o terms, O(c 2 ) and O( c 2 ), satisfy the conditions, O(c 2 )/c 2 < as c 0 and O( c 2 )/ c 2 < as c 0. The FIM estimate can be made as accurate as desired through reducing c and c and increasing the number of Ĥk values being averaged. In the next chapter, a modification is proposed in the existing resampling algorithm that would enhance the statistical characteristics of the estimator of F n (θ), a part of which is analytically known a priori. 13

Chapter 2 Improved Re-sampling Algorithm Let the k-th estimate of Hessian matrix, H(θ Z pseudo (k)), per proposed resampling algorithm be denoted by H k. In this chapter, the estimator, H k, is shown separately for two different cases: Case 1 - when only the measurements of the log-likelihood function L are available and, Case 2: when measurements of the exact gradient vector g are available. To contrast the two cases, the superscript (L) is used in H (L) k and Ĥ (L) k to represent the dependence of H k and Ĥk on L measurements for Case 1 and, similarly, the superscript (g) in H (g) k and Ĥ(g) k for Case 2. (The superscripts, (L) or (g), should not be confused with the superscript, (i), as appeared in (1.2) which referred to the pseudo data vector. The superscript (i) was eventually suppressed in consideration of M being 1 as indicated at p. 10.) 14

2.1 Additional Notation Denote the (i, j)-th element of F n (θ) by F (θ) (non-bold character and suppressing the subscript, n, in the symbolic notation representing the element of F n (θ)) for simplification of notation. Let I i, i = 1,, p, be the set of column indices of the known elements of F n (θ) for the i-th row and I c i be the complement of I i. Consider a p p matrix F (given) n F (given) = whose (i, j)-th element, F (given), is defined as follows, F (θ), if j I i 0, otherwise Consider another p p matrix D k defined by, D k = = k1 k1 k2 k1. kp k1 k1. kp k1 k2 k2 k2. kp... k2, i = 1,, p. (2.1) k1 kp k2 kp. kp kp [ 1 k1,, 1 kp ] = k [ 1 k1,, 1 kp ]. (2.2) together with a corresponding matrix D k based on replacing all ki in D k by the corresponding ki (note that D k is symmetric when the perturbations are i.i.d. Bernoulli distributed). 15

2.2 Modified Re-sampling Algorithm using Prior Information The new estimate, Hk, is extracted from H k0 that is defined next separately for Case 1 and Case 2: Case 1: only the measurements of the log-likelihood function L are available, [ ( D T k ( F (given) n ) D k + DT k H (L) k0 = Ĥ(L) k 1 2 ( F (given) n ) D k ) T ]. (2.3) Case 2: measurements of the exact gradient vector g are available, H (g) k0 = Ĥ(g) k 1 [ ( F (given) n ) D k + ( ) ( F (given) T ] n ) D k. (2.4) 2 The estimates, H(L) k and (g) H k, are readily obtained from, respectively, H(L) k0 in (2.3) and H (g) k0 (L) in (2.4) by replacing the (i, j)-th element of H k0 (g) and H k0 with known values of F (θ), j I i, i = 1,, p. The new estimate, F n, of F n (θ) is then computed by averaging the Hessian estimates, Hk, and taking negative value of the resulting average. For convenience and, also since the main objective is to estimate the FIM, not the Hessian matrix, the replacement of the (i, j)-th element of H k0 with known values of F (θ), j I i, i = 1,, p, is not required at each k-th step. The matrix, F n0, can be first obtained by computing the (negative) average of the matrices, H k0, and subsequently, the (i, j)-th element of F n0 can be replaced with the analytically known elements, F (θ), j I i, i = 1,, p, of F n (θ) to yield the new estimate, F n. Consequently, replacing needs to be done only once at the end of averaging. However, if a better estimate of Hessian matrix is also required (e.g., in optimization [11, 14]), 16

the replacing as mentioned above for Hessian estimates is recommended at each k-th step. The matrices, Ĥ (L) k in (2.3) or Ĥ(g) k in (2.4), need to be computed by using the existing resampling algorithm implying that Ĥ(L) k Ĥk or Ĥ(g) k Ĥk as appropriate with Ĥk being given by (1.2). Note that F (given) n as shown in the right-hand-sides of (2.3) and (2.4) is known by (2.1). It must be noted as well that the random perturbation vectors, k in D k and k in D k, as required in (2.3) and (2.4) must be the same simulated values of k and k used in the existing resampling algorithm while computing the k-th estimate, Ĥ(L) k or Ĥ(g) k. Next a summary of the salient steps, required to produce the estimate F n (i.e., with appropriate superscript, F (L) n or (g) F n ) of F n (θ) per modified resampling algorithm as proposed here, is presented below. Figure 2.1 is a schematic of the following steps. Step 0. Initialization: Construct the matrix F (given) n as defined by (2.1) based on the analytically known elements of FIM. Determine θ, the sample size n and the number (N) of pseudo data vectors that will be generated. Determine whether log-likelihood L( ) or gradient vector g( ) will be used to compute the Hessian estimates H k. Pick a small number c (perhaps c = 0.0001) to be used for Hessian estimation (see (1.2)) and if required another small number c (perhaps c = 0.00011) for gradient approximation (see (1.3)). Set k = 1. 17

Step 1. At the k-th step perform the following tasks, a. Generation of pseudo data: Based on θ, generate by MC simulation technique the k-th pseudo data vector of n pseudo measurements Z pseudo (k). b. Computation of Ĥk: Generate the k-th perturbation vector k by satisfying C.1 and if required the other perturbation vector k by satisfying C.1 for gradient approximation. Using the k-th pseudo data vector Z pseudo (k) and the perturbation vector(s) k or/and k, evaluate Ĥk (i.e., Ĥ(L) k or Ĥ(g) k ) by using (1.2). c. Computation of D k and D k : Use k or/and k, as generated in the above step, to construct the matrices, D k or/and D k, as defined in section 2.1. d. Computation of H k0 : Modify Ĥk as produced in Step 1b by employing (2.3) or (2.4) as appropriate in order to generate H k0 (i.e., H (L) k0 (g) or H k0 ). Step 2. Average of H k0 : Repeat Step 1 until N estimates, H k0, are produced. Compute the (negative) mean of these N estimates. (The standard recursive representation of sample mean can be used here to avoid the storage of N matrices, H k0 ). The resulting (negative) mean is F n0. Step 3. Evaluation of F n : The new estimate, F n, of F n (θ) per modified re- 18

sampling algorithm is simply obtained by replacing the (i, j)-th element of F n0 with the analytically known elements, F (θ), j I i, i = 1,, p, of F n (θ). To avoid the possibility of having a non-positive semi-definite estimate, it may be desirable to take the symmetric square root of the square of the estimate (the sqrtm function in MATLAB may be useful here). Pseudo data Input Hessian estimates, ^ k H Hessian ~ estimates, H k Zpseudo (1) ^ H 1. Z pseudo ( ) N Current re sampling algorithm. H^ N (given) F n Use ~ H 1. ~ H N H ~ k Negative average of New FIM ~ estimate, F n Figure 2.1: Schematic of algorithm for computing the new FIM estimate, F n. The new estimator, Fn, is better than ˆF n in the sense that it would preserve exactly the analytically known elements of F n (θ) as well as reduce the variances of the estimators of the unknown elements of F n (θ). The next section sketches the theoretical basis of the modified resampling algorithm as proposed in this section. 19

2.3 Theoretical Basis for the Modified Re-sampling Algorithm It should be remarked here for the sake of clarification that the dependence of H k and Ĥk on θ, Z pseudo (k), k, c or/and k and c are suppressed throughout this work for simplification of notation. Now, for further notational simplification, the subscript pseudo in Z pseudo (k) and the dependence of Z(k) on k would be suppressed (note that Z pseudo (k) is identically distributed across k because Z pseudo (k) p Zn ( θ)). Since, k is usually assumed to be statistically independent across k and an identical condition for k is also assumed, the dependence on k would also be suppressed in the future development. Let the (i, j)-th element of Ĥk and H k be, respectively, denoted by Ĥ and H with the appropriate superscript. In the following, the mean and variance of both Ĥ and H are computed for two cases in two different subsections. It is also shown that the variance of the new estimator, H, is less than the variance of the existing estimator, Ĥ, yielding a better estimate of F n (θ) in the process. 2.3.1 Case 1: only the measurements of L are available The mean and variance of Ĥ (L) are computed first followed by the mean and variance of H(L). The subsequent comparison of variances will show the superiority of (L) H k, which leads to the superiority of F (L) n. 20

In deducing the expressions of mean and variance in the following, the conditions, C.1 and C.3, are required along with the following condition as stated below. (The condition, C.2, only ensures that employing the resampling algorithm would yield some estimate but it cannot guarantee the boundedness of several big-o terms as mentioned earlier or as will be encountered while deducing the expressions of mean and variance below.) C.4: (Existence of covariance involving H lm, H lm,s and H lm,rs ) All the combinations of covariance terms involving H lm (θ Z), H lm,s (θ Z) and H lm,rs (θ Z), l, m, s, r = 1,, p, exist θ S θ (see C.2 for definition of S θ ). Mean and Variance of Ĥ(L) It is assumed here that the gradient estimate is based on one-sided gradient approximation using SP technique given by (1.3). Based on a Taylor expansion, the i-th component of G (1) (θ Z), i = 1,, p, that is an approximation of the i-th component, g i (θ Z) L(θ Z)/ θ i, of g(θ Z) based on the values of L( Z), is given by, G (1) i (θ Z) = L(θ + c Z) L(θ Z) = = l 1 c i c (2.5) i p g l (θ Z) l + 1 p p 2 c2 H lm (θ Z) m l [( L(θ Z) + c + 1 6 c3 p p p l=1 m=1 s=1 g l (θ) l i + 1 2 c l,m l=1 3 L(θ Z) θ s θ m θ l s m l l=1 m=1 ) ] L(θ Z) H lm (θ) m l + 1 H lm (θ) s m l 6 c2, (2.6) i θ s l,m,s i 21

in which H lm (θ Z) 2 L(θ Z)/ θ l θ m is the (l, m)-th element of H(θ Z), θ = λ(θ + c ) + (1 λ)θ = θ + cλ (with λ [0, 1] being some real number) denotes a point on the line segment between θ and θ + c and, in the expression after the last equality, the condition on Z is suppressed for notational clarity and, also the summations are expressed in abbreviated format where the indices span their respective and appropriate ranges. This convention would be followed throughout this work unless the explicit dependence of Z and an expanded summation format need to be invoked for the sake of clarity of the context to be discussed. Now note that E[G (1) i (θ Z) θ, Z] = l [ ] l g l (θ Z)E + 0 + 1 6 c2 i l,m,s H lm (θ Z) E θ s [ ] s m l i = g i (θ Z) + O Z ( c 2 ) (2.7) in which the point of evaluation, θ, is suppressed in the random big-o term, O Z ( c 2 ) = (1/6) c 2 l,m,s [ H lm(θ Z)/ θ s ]E[( s m l )/ i ], for notational clarity. The subscript in O Z ( ) explicitly indicate that this term depends on Z satisfying O Z ( c 2 )/ c 2 < almost surely (a.s.) (a.s. with respect to the joint probability measure of Z) as c 0 by C.1 and C.2. The first equality follows from the fact that the second term in (2.6) vanishes upon expectation since it involves either E[ r ] or E[1/ r ], r = 1,, p, both of which are zero by implication I. The second equality follows by the condition C.1. Now G i (θ ± c Z) G (1) i (θ ± c Z), that is required for computing Ĥ(L), need to be calculated. Suppressing the condition on Z (again for simplification of no- 22

tation, specially when the dependence is not important in the context of the topic being discussed) and using the abbreviated summation notation as indicated earlier, G (1) i (θ ± c ) with subsequent and further Taylor expansion can be written as follows, G (1) i (θ ± c ) = g l (θ ± c ) l + 2 c 1 H lm (θ ± c ) m l + 1 H lm (θ ± c ) l i 6 c2 l,m i θ s l,m,s }{{}}{{} 1st part 2nd part [ = g l (θ) l + (±c) l H lm (θ) m + 1 H lm (θ) l l i l,m i 2 (±c)2 s m θ s l,m,s i + 1 6 (±c)3 l,m,s,r + 1 2 c(±c) l,m,s + 1 6 c2 (±c) l,m,s,r 2 H lm (θ ± 1 ) l r s m θ r θ s i ] H lm (θ ± 2 ) m l s θ s i + ] [ 1 2 H lm (θ ± ) s m l r θ r θ s i + s m l i } {{ } 3rd part [ 2 c 1 H lm (θ) m l l,m i 6 c2 l,m,s ] H lm (θ) θ s s m l i. (2.8) Here, θ ± 1 and θ ± 2 denote some points on the line segments between θ and θ ± c, θ ± denote some points on the line segments between θ and θ ± c, the superscript, ± sign, in θ ± 1, θ ± 2 and θ ± corresponds to whether θ + c or θ c is being considered for the argument of G (1) i ( ) in (2.8). Given G i (θ ± c Z) G (1) i (θ ± c Z) by (2.8), the (i, j)-th element of Ĥ(L) k can be readily obtained from, Ĥ (L) = 1 [ Gi (θ + c Z) G i (θ c Z) + G ] j(θ + c Z) G j (θ c Z). (2.9) 2 2 c j 2 c i Consider a typical term (say, the first term in the right-hand-side of (2.9)) and denote it by Ĵ (L) (J is to indicate Jacobian for which the symmetrizing operation should not 23

be used). By the use of (2.8), Ĵ (L), then can be written as, Ĵ (L) G i(θ + c Z) G i (θ c Z) 2 c j [ ] = H lm (θ Z) m l +O,,Z (c j 2 ) + [ O,,Z ( c) ] + [ O l,m i }{{},,Z ( c 2 ) ].(2.10) }{{}}{{} 2nd part 3rd part 1st part The subscripts in O,,Z ( ) explicitly indicate that the random big-o terms depend on, and Z. By the use of C.1 and C.3, it should be noted that these terms are such that O,,Z (c 2 )/c 2 < almost surely (a.s.) (a.s. with respect to the joint probability measure of, and Z) as c 0 and, both O,,Z ( c)/ c < a.s. and O,,Z ( c 2 )/ c 2 < a.s. as c 0. The explicit expressions of the big-o terms in (2.10) are shown below separately, O,,Z (c 2 ) = 1 12 c2 l,m,s,r ( 2 H lm (θ + 1 Z) θ r θ s + 2 H lm (θ ) 1 Z) r s m θ r θ s j l i, O,,Z ( c) = 1 4 c l,m,s ( Hlm (θ + 2 Z) θ s + H lm(θ ) 2 Z) s m l, θ s j i O,,Z ( c 2 ) = 1 12 c2 l,m,s,r ( ) 2 H lm (θ + Z) + 2 H lm (θ Z) r s m l. θ r θ s θ r θ s j i The effects of O,,Z ( c 2 ) are not included in O,,Z ( c). The reason for showing O,,Z ( c) separately in (2.10) is that this term vanishes upon expectation because it involves either E[ r ] or E[1/ r ], r = 1,, p, both of which are zero by implication I and rest of the terms appeared in O,,Z ( c) do not depend on (as mentioned earlier below (2.8) that θ ± 2 in O,,Z ( c) depend only on θ and θ ± c ). The other 24

terms, O,,Z (c 2 ) and O,,Z ( c 2 ), do not vanish upon expectation yielding, [ ] E[Ĵ (L) θ] = E H lm (θ Z) m l j l,m i θ + E [ O,,Z (c 2 ) θ ] +E [ O,,Z ( c) θ ] + E [ O,,Z ( c 2 ) θ ] = [ ] E H lm (θ Z) m l j l,m i θ + O(c 2 ) + 0 + O( c 2 ) = [ ] [ ] m l E [H lm (θ Z) θ] E E + O(c j 2 ) + O( c 2 ) l,m i = E [H (θ Z) θ] + O(c 2 ) + O( c 2 ) (2.11) Note that the big-o terms, O(c 2 ) and O( c 2 ), satisfying O(c 2 )/c 2 < as c 0 and O( c 2 )/ c 2 < as c 0 (by C.1 and C.3), are deterministic unlike the random big-o terms in (2.10). The third equality follows from the fact that, and Z are statistically independent of each other. The fourth equality follows by the condition C.1. In (2.11), E [H (θ Z) θ] has not been substituted by F (θ) in order to keep the derivation of (2.11) most general since (2.11) will be applicable as well for Jacobian estimate. However, in the context of FIM, E [H (θ Z) θ] = F (θ) by (Hessianbased) definition using which E[Ĥ(L) θ] is obtained as, E[Ĥ(L) θ] = 1 2 ( E[Ĵ (L) ) θ] + E[Ĵ (L) θ] ji = 1 2 (E [H (θ Z) θ] + E [H ji (θ Z) θ]) + O(c 2 ) + O( c 2 ) = F (θ) + O(c 2 ) + O( c 2 ). (2.12) The last equality follows from the fact that the FIM, F n (θ), is symmetric. 25

The variance of Ĥ(L) is to be computed next. It is given by, var[ĥ(l) θ] = 1 (L) var[(ĵ 4 ( = 1 4 var[ĵ (L) + Ĵ (L) ji ) θ] ) θ] + var[ĵ (L) θ] + 2cov[Ĵ (L), Ĵ (L) θ]. (2.13) ji ji The expression of a typical variance term, var[ĵ (L) θ], in (2.13) would now be determined followed by the deduction of the expression of covariance term, cov[ĵ (L) By the use of (2.10), var[ĵ (L) θ] is given by, [( var[ĵ (L) θ] = var l,m H lm (θ Z) m j l i + O,,Z (c 2 ) + O,,Z ( c), Ĵ (L) θ]. ] + O,,Z ( c )) 2 θ [ ] = var H lm (θ Z) m l j l,m i θ +var [ O,,Z (c 2 ) θ ] +var [ O,,Z ( c) θ ] + var [ O,,Z ( c 2 ) θ ] + 2cov [ O,,Z (1), O,,Z (c 2 ) θ ] + 2cov [ O,,Z (1), O,,Z ( c) θ ] + 2cov [ O,,Z (1), O,,Z ( c 2 ) θ ] + 2cov [ O,,Z (c 2 ), O,,Z ( c) θ ] +2cov [ O,,Z (c 2 ), O,,Z ( c 2 ) θ ] + 2cov [ O,,Z ( c), O,,Z ( c 2 ) θ ], (2.14) ji in which O,,Z (1) represents the term l,m H lm(θ Z)( m / j )( l / i ). Note that the variance terms, var [ O,,Z (c 2 ) θ ], var [ O,,Z ( c) θ ] and var [ O,,Z ( c 2 ) θ ], respectively, turn out to be O(c 4 ), O( c 2 ) and O( c 4 ) by C.4. While the covariance terms, 2cov [ O,,Z (1), O,,Z (c 2 ) θ ], 2cov [ O,,Z (1), O,,Z ( c 2 ) θ ] and 2cov [ O,,Z (c 2 ), O,,Z ( c 2 ) θ ], respectively, turn out to be O(c 2 ), O( c 2 ) and O(c 2 c 2 ) by C.4, a few other covariance terms 2cov [ O,,Z (1), O,,Z ( c) θ ] and 2cov [ O,,Z (c 2 ), 26

O,,Z ( c) θ ] vanish since, for any given i, they involve terms like E[( l1 m2 l2 )/ 2 i ], l 1, l 2, m 2 = 1,, p, all of which are zero by implication I and rest of the terms appeared in O,,Z ( c) and O,,Z (c 2 ) do not depend on (as already mentioned below (2.8) that θ ± 1 in O,,Z (c 2 ) and θ ± 2 in O,,Z ( c) depend only on θ and θ ± c ). Although the remaining term, 2cov [ O,,Z ( c), O,,Z ( c 2 ) θ ], in (2.14) involves terms like E[( m1 l1 l2 m2 n2 )/ 2 i ], l 1, m 1, n 1, l 2, m 2, n 2 = 1,, p, which are zero by implication I, the big-o term, O,,Z ( c 2 ), contains θ that depends on θ and θ + c (see below (2.6)). Therefore, 2cov [ O,,Z ( c), O,,Z ( c 2 ) θ ] is not zero and turns out to be O( c 3 ) by C.4. Then, by absorbing the term O(c 4 ) into O(c 2 ) and both O( c 4 ) and O( c 3 ) into O( c 2 ), the variance, var[ĵ (L) θ], turns out to be, var[ĵ (L) θ] = var [ l,m ( = E = l,m H lm (θ Z) m j H lm (θ Z) m j ] l i θ + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) ) 2 ( [ ]) l i θ E H lm (θ Z) 2 m l j i θ + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) [ m1 m2 E [{H l1 m 1 (θ Z)H l2 m 2 (θ Z)} θ] E 2 l 1,m 1,l 2,m 2 j l,m ] E [ l1 l2 (E [H (θ Z) θ]) 2 + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) = E [ {H lm (θ Z)} 2 [ ] [ ] ] 2 θ E m 2 E l (E [H 2 l,m j 2 (θ Z) θ]) 2 i = l,m + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) { var [Hlm (θ Z) θ] + (E [H lm (θ Z) θ]) 2} E [ 2 m 2 j (E [H (θ Z) θ]) 2 + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) 2 i ] [ ] 2 E l 2 i ] 27

= l,m a lm (i, j)var [H lm (θ Z) θ] + l,m lm a lm (i, j) (E [H lm (θ Z) θ]) 2 + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) (2.15) in which a lm (i, j) = E[ 2 m/ 2 j]e[ 2 l / 2 i ]. The first term after the third equality follows from the fact that, and Z are statistically independent of each other and the second term has already been seen during the derivation of E[Ĵ (L) θ] in (2.11). The fourth equality follows by condition C.1 and implication I. Note that var[ĥ(l) ii θ] follows straight from the expression of var[ĵ (L) θ] by replacing j with i in (2.15). Next, the expression of cov[ĵ (L), Ĵ (L) θ], j i, would be deduced. By using an ji identical argument to the one that is mentioned right after (2.14), it can be readily shown that cov[ĵ (L) cov[ĵ (L), Ĵ (L) ji θ] = cov, Ĵ (L) ji θ] is given by, [( l,m H lm (θ Z) m j ) ( l, i l,m H lm (θ Z) m i ) ] l j θ + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ). (2.16) Further simplification of the first term in the right-hand-side yields, [( ) ( ) ] cov H lm (θ Z) m l, H lm (θ Z) m l j l,m i i l,m j θ [( ) ( ) ] = E H lm (θ Z) m l H lm (θ Z) m l j i i j θ = = l,m l,m E [H (θ Z) θ] E [H ji (θ Z) θ] [ ] [ ] m1 m2 l1 l2 E [{H l1 m 1 (θ Z)H l2 m 2 (θ Z)} θ] E E (F (θ)) 2 i j l 1,m 1,l 2,m 2 i j { E [H (θ Z)H ji (θ Z) θ] + E [H ii (θ Z)H jj (θ Z) θ] + E [H jj (θ Z)H ii (θ Z) θ] } + E [H ji (θ Z)H (θ Z) θ] (F (θ)) 2 28

= 2E [H (θ Z)H ji (θ Z) θ] + 2E [H ii (θ Z)H jj (θ Z) θ] (F (θ)) 2 = 2E [ {H (θ Z)} 2 θ ] + 2E [Hii (θ Z)H jj (θ Z) θ] (F (θ)) 2 = 2 { var [(H (θ Z)) θ] + (E [(H (θ Z)) θ]) 2} + 2E [H ii (θ Z)H jj (θ Z) θ] (F (θ)) 2. (2.17) Here, the first term after the second equality follows from the fact that, and Z are statistically independent of each other and the second term after the second equality follows by the Hessian-based definition and symmetry of FIM. The third equality follows by the use of the fact that j i and by condition C.1 and implication I. Therefore, cov[ĵ (L), Ĵ (L) θ] turns out to be, ji cov[ĵ (L), Ĵ (L) θ] ji = 2 { var [(H (θ Z)) θ] + (E [(H (θ Z)) θ]) 2} + 2E [H ii (θ Z)H jj (θ Z) θ] (F (θ)) 2 + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ), for j i. (2.18) Now, the variance of Ĥ (L), var[ĥ(l) θ], for j i, can be readily obtained from (2.13) by using (2.15) and (2.18). As already indicated, var[ĥ(l) ii θ] is same as var[ĵ (L) ii θ] that can be directly obtained from (2.15) by replacing j with i. Since the primary objective of the current work is to show that the variance of the new estimators, F, of the elements, F (θ), of F n (θ) are less than that of the current estimators, ˆF, it would suffice to show that var[ H (L) θ] is less than var[ĥ(l) θ]. This later fact can be proved by comparing the contributions of the variance and covariance terms to var[ (L) H θ] with the contributions of the respective variance and 29

covariance terms (as appeared in (2.13)) to var[ĥ(l) θ]. This is the task of the next section. Mean and Variance of H (L) We now derive the mean and variance of the element of H(L) for the purpose of contrasting with the same quantities for Ĥ(L). Consider the (i, j)-th element of H k associated with (2.3) that is given by, H (L) = Ĥ(L) 1 2 and = 1 2 = 1 2 (Ĵ (L) ( J (L) + Ĵ (L) ji + [ ] ( F lm (θ)) m l + ( F lm (θ)) m l, j I j l m I l i i c i j [ ] ) 1 ( F lm (θ)) m l + ( F lm (θ)) m l, j I 2 j l m I l i i c i j ) (L) J ji, j I c i (2.19) H (L) = F (θ), j I i. (2.20) In (2.19), J (L) is defined as, J (L) = Ĵ (L) l ( F lm (θ)) m j m I l l i, j I c i. (2.21) Note that j I c i in (2.21), [ E ( F lm (θ)) m j m I l l ] l i θ = 0, implying that E[ (L) J ] = E[Ĵ (L) ], j I c i. By using this fact along with (2.19), (2.11) and the symmetry of the FIM (for all j I c i) and by using (2.20) (for all j I i ), the 30

following can be obtained readily, i = 1,, p, F (L) (θ) + O(c 2 ) + O( c 2 ), j I c i, E[ H θ] = F (θ), j I i. (2.22) Noticeably, the expressions of the big-o terms both in (2.12) and in the first equation of (2.22) are precisely same implying that E[ While var[ (L) H θ] = E[Ĥ(L) θ], j I c i. (L) H θ] = 0, j I i, clearly implying that var[ j I i, deduction of the expression of var[ H (L) θ] < var[ĥ(l) θ], (L) H θ], j I c i, is the task that will be considered now. In fact, this is the main result associated with the variance reduction from prior information available in terms of the known elements of F n (θ). The first step in determining this variance is to note that the expression of Ĵ (L) in (2.10) can be decomposed into two parts as shown below, Ĵ (L) = H lm (θ Z) m l + H lm (θ Z) m j l m Il i m I c j l l i + O,,Z (c 2 ) + O,,Z ( c) + O,,Z ( c 2 ). (2.23) The elements, H lm (θ Z), of H(θ Z) in the right-hand-side of (2.23) are not known. However, since by (Hessian-based) definition E[H lm (θ Z) θ] = F lm (θ), approximation of the unknown elements of H(θ Z) in the right-hand-side of (2.23), particularly those elements that correspond to the elements of the FIM that are known a priori, by the negative of those elements of F n (θ) is the primary idea based on which the modified resampling algorithm is developed. This approximation introduces an error term, e lm (θ Z), that can be defined by, H lm (θ Z) = F lm (θ) + e lm (θ Z), m I l, l = 1,, p, (2.24) 31

and this error term satisfies the following two conditions that directly follow from (2.24), m I l, l = 1,, p, E[e lm (θ Z) θ] = 0, (2.25) var[e lm (θ Z) θ] = var[h lm (θ Z) θ]. (2.26) Substitution of (2.24) in (2.23) results in a known part in the right-hand-side of (2.23) involving the analytically known elements of FIM. This known part is transferred to the left-hand-side of (2.23) and, consequently, acts as a feedback to the current resampling algorithm yielding, in the process, J (L) J (L) ( = l Ĵ (L) l, ( F lm (θ)) m j m I l m Il e lm (θ Z) m j l i + m I c l ) l i H lm (θ Z) m j l i + O,,Z (c 2 ) + O,,Z ( c) + O,,Z ( c 2 ), j I c i. (2.27) This expression of J (L) can be employed to show that the new estimator, H(L), in (2.19) has less variance than that of the current estimator, Ĥ (L). This is achieved in the following. Introduce X lm, l = 1,, p, as defined below, e lm (θ Z), if m I l, X lm (θ Z) = (2.28) H lm (θ Z), if m I c l. Then, (2.27) can be compactly written as, J (L) = l,m X lm (θ Z) m j l i +O,,Z (c 2 )+O,,Z ( c)+o,,z ( c 2 ), j I c i. (2.29) 32

On the other hand, variance of H (L) in (2.19) is given by, var[ (L) H θ] = 1 (L) var[( J + 4 = 1 4 ( var[ (L) J ji ) θ] (L) J θ] + var[ (L) J ji θ] + 2cov[ (L) J, ) (L) J ji θ]. (2.30) Therefore, var[ (L) J θ] and cov[ (L) J, to compare them, respectively, with var[ĵ (L) and, consequently, facilitating the comparison between var[ j I c i. (L) J ji θ] are required to be computed next in order θ] in (2.15) and cov[ĵ (L), Ĵ (L) ji θ] in (2.18) H (L) θ] and var[ĥ(l) θ], The variance of (L) J, j I c i, can be computed by considering the right-hand-side of (2.29) in an identical way as described for Ĵ (L) in the previous subsection. The expression for var[ (L) J θ], j I c i, follows readily from (2.15) by replacing H lm with X lm because of the similarity between (2.10) and (2.29) as shown below, j I c i, i = 1,, p, var[ (L) J θ] = a lm (i, j)var [X lm (θ Z) θ] + a lm (i, j) (E [X lm (θ Z) θ]) 2 l,m l,m lm = l,m + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) a lm (i, j)var [H lm (θ Z) θ] + l a lm (i, j) (E [H lm (θ Z) θ]) 2 m I c l lm + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ). (2.31) The second equality follows by the use of (2.28) and a subsequent use of (2.25)-(2.26) on the resulting expression. 33

Subtracting (2.31) from (2.15), it can be readily seen that j I c i, var[ĵ (L) (L) θ] var[ J θ] = l = l a lm (i, j) (E [H lm (θ Z) θ]) 2 m I l + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) a lm (i, j) (F lm (θ)) 2 m I l + O(c 2 ) + O( c 2 ) + O(c 2 c 2 ) > 0. (2.32) Here, the second equality follows by the Hessian-based definition of FIM, E[H lm (θ Z) θ] = F lm (θ), and the inequality follows from the fact that a lm (i, j) = (E[ 2 m/ 2 j] E[ 2 l / 2 i ]) > 0, l, m = 1,, p, for any given (i, j) and assuming that at least one of the known elements, F lm (θ), in (2.32) is not equal to zero. It must be remarked that the bias terms, O(c 2 ), O( c 2 ) and O(c 2 c 2 ), can be made negligibly small by selecting c and c small enough that are primarily controlled by users. Note that if 1,, p and 1,, p are both assumed to be Bernoulli ±1 i.i.d. random variables, then a lm (i, j) turns out to be unity. At this point it is already clear that var[ since var[ H (L) ii respectively, and var[ θ] and var[ĥ(l) θ] are exactly given by var[ Next step is to compare cov[ order to conclude that var[ ii (L) J ii θ] < var[ĵ (L) ii θ] by (2.32). (L) J, J (L) ji (L) H ii θ] < var[ĥ(l) ii θ] if j = i I c i, J (L) ii θ] and var[ĵ (L) θ], θ] to cov[ĵ (L), Ĵ (L) θ], j i, j I c i, in (L) H θ] < var[ĥ(l) θ]. As var[ ji (L) J θ] is deduced from var[ĵ (L) θ] by the similarity of the expressions between (2.10) and (2.29), the expression of cov[ (L) J, (L) J ji θ] also follows readily by replacing H lm with X lm and using ii 34

identical arguments that are used in deducing (2.16) and (2.17). The expression of cov[ cov[ (L) J, (L) J, J (L) ji θ], therefore, turns out to be, for j i, j I c i, i = 1,, p, J (L) ji θ] = 2 { var [(X (θ Z)) θ] + (E [(X (θ Z)) θ]) 2} + 2E [X ii (θ Z)X jj (θ Z) θ] (F (θ)) 2 +O(c 2 )+O( c 2 )+O(c 2 c 2 ) = 2 { var [(H (θ Z)) θ] + (E [(H (θ Z)) θ]) 2} + 2E [X ii (θ Z)X jj (θ Z) θ] (F (θ)) 2 + O(c 2 )+O( c 2 )+O(c 2 c 2 ). (2.33) Since j I c i, the second equality follows from the definition of X in (2.28). In (2.33), X ii and X jj must take the form from one of following four possibilities: (1) e ii and e jj, (2) e ii and H jj, (3) H ii and e jj, (4) H ii and H jj. Subtracting (2.33) from (2.18), the difference between cov[ĵ (L), Ĵ (L) θ] and cov[ ji (L) J, (L) J ji θ] is given by, cov[ĵ (L), Ĵ (L) θ] cov[ ji (L) J, (L) J ji θ] = 2 (E [H ii (θ Z)H jj (θ Z) θ] E [X ii (θ Z)X jj (θ Z) θ]) +O(c 2 ) + O( c 2 ) + O(c 2 c 2 ), for j i, j I c i. (2.34) For the first possibility, it is shown below that E[H ii (θ Z)H jj (θ Z) θ] E[X ii (θ Z) X jj (θ Z) θ] is given by F ii (θ)f jj (θ) that is greater than 0 since F n (θ) is positive definite matrix and a positive definite matrix always has non-zero positive diagonal entries. E [H ii (θ Z)H jj (θ Z) θ] E [X ii (θ Z)X jj (θ Z) θ] 35