Improved Methods for Monte Carlo Estimation of the Fisher Information Matrix

Similar documents
Cramér Rao Bounds and Monte Carlo Calculation of the Fisher Information Matrix in Difficult Problems 1

EFFICIENT CALCULATION OF FISHER INFORMATION MATRIX: MONTE CARLO APPROACH USING PRIOR INFORMATION. Sonjoy Das

Efficient Monte Carlo computation of Fisher information matrix using prior information

1216 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE /$ IEEE

Large Sample Properties of Estimators in the Classical Linear Regression Model

Recent Advances in SPSA at the Extremes: Adaptive Methods for Smooth Problems and Discrete Methods for Non-Smooth Problems

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

Monte Carlo Computation of the Fisher Information Matrix in Nonstandard Settings

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

CONSTRAINED OPTIMIZATION OVER DISCRETE SETS VIA SPSA WITH APPLICATION TO NON-SEPARABLE RESOURCE ALLOCATION

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Further Results on Model Structure Validation for Closed Loop System Identification

DA Freedman Notes on the MLE Fall 2003

Linear Regression and Its Applications

A Few Notes on Fisher Information (WIP)

Parameter Estimation

On Identification of Cascade Systems 1

Brief Review on Estimation Theory

Semiparametric Gaussian Copula Models: Progress and Problems

Economics 583: Econometric Theory I A Primer on Asymptotics

EE731 Lecture Notes: Matrix Computations for Signal Processing

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Economics 241B Review of Limit Theorems for Sequences of Random Variables

Semiparametric Gaussian Copula Models: Progress and Problems

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Concentration Ellipsoids

Cramér-Rao Bounds for Estimation of Linear System Noise Covariances

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

ECE 275A Homework 6 Solutions

The loss function and estimating equations

Estimators as Random Variables

FDSA and SPSA in Simulation-Based Optimization Stochastic approximation provides ideal framework for carrying out simulation-based optimization

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Backward Error Estimation

2 Statistical Estimation: Basic Concepts

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Unconstrained optimization

Online Appendix. j=1. φ T (ω j ) vec (EI T (ω j ) f θ0 (ω j )). vec (EI T (ω) f θ0 (ω)) = O T β+1/2) = o(1), M 1. M T (s) exp ( isω)

Module 3. Function of a Random Variable and its distribution


Lecture 7 Introduction to Statistical Decision Theory

Closest Moment Estimation under General Conditions

A Primer on Asymptotics

Recursive Generalized Eigendecomposition for Independent Component Analysis

Introduction to Maximum Likelihood Estimation

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Robust Backtesting Tests for Value-at-Risk Models

Growing Window Recursive Quadratic Optimization with Variable Regularization

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Chapter 3. Point Estimation. 3.1 Introduction

Lecture Notes 1: Vector spaces

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

L03. PROBABILITY REVIEW II COVARIANCE PROJECTION. NA568 Mobile Robotics: Methods & Algorithms

Multivariate Time Series

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Efficient Implementation of Approximate Linear Programming

STA 4273H: Statistical Machine Learning

The properties of L p -GMM estimators

ARIMA Modelling and Forecasting

Inference For High Dimensional M-estimates. Fixed Design Results

Analogy Principle. Asymptotic Theory Part II. James J. Heckman University of Chicago. Econ 312 This draft, April 5, 2006

Stochastic Spectral Approaches to Bayesian Inference

Lecture 4 September 15

1 Appendix A: Matrix Algebra

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

On the Identification of a Class of Linear Models

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Estimation, Detection, and Identification

Estimation Theory Fredrik Rusek. Chapters

Econometrics I, Estimation

9 Brownian Motion: Construction

Estimation of Dynamic Regression Models

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Kalman Filters with Uncompensated Biases

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

Testing Overidentifying Restrictions with Many Instruments and Heteroskedasticity

Stochastic Nonlinear Stabilization Part II: Inverse Optimality Hua Deng and Miroslav Krstic Department of Mechanical Engineering h

Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances

CALCULATION METHOD FOR NONLINEAR DYNAMIC LEAST-ABSOLUTE DEVIATIONS ESTIMATOR

Neural Network Training

UNIFORMLY MOST POWERFUL CYCLIC PERMUTATION INVARIANT DETECTION FOR DISCRETE-TIME SIGNALS

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Adaptive Dual Control

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

INTRODUCTION TO INTERSECTION-UNION TESTS

Multivariate Time Series: VAR(p) Processes and Models

A COMPARISON OF OUTPUT-ANALYSIS METHODS FOR SIMULATIONS OF PROCESSES WITH MULTIPLE REGENERATION SEQUENCES. James M. Calvin Marvin K.

EE731 Lecture Notes: Matrix Computations for Signal Processing

DETERMINANTS, COFACTORS, AND INVERSES

EIE6207: Estimation Theory

A Comparison of Robust Estimators Based on Two Types of Trimming

ECE 275B Homework # 1 Solutions Version Winter 2015

Closest Moment Estimation under General Conditions

Chapter 4: Asymptotic Properties of the MLE

Gaussian, Markov and stationary processes

Dynamic System Identification using HDMR-Bayesian Technique

«Random Vectors» Lecture #2: Introduction Andreas Polydoros

Transcription:

008 American Control Conference Westin Seattle otel, Seattle, Washington, USA June -3, 008 ha8. Improved Methods for Monte Carlo stimation of the isher Information Matrix James C. Spall (james.spall@jhuapl.edu) he Johns opkins University Applied Physics Laboratory Laurel, Maryland 073-6099 U.S.A. Summary: he isher information matrix summarizes the amount of information in a set of data relative to the quantities of interest and forms the basis for the Cramér- Rao (lower) bound on the uncertainty in an estimate. here are many applications of the information matrix in modeling, systems analysis, and estimation. his paper presents a resampling-based method for computing the information matrix together with some new theory related to efficient implementation. We show how certain properties associated with the likelihood function and the error in the estimates of the essian matrix can be exploited to improve the accuracy of the Monte Carlobased estimate of the information matrix. Key words: System identification; Monte Carlo simulation; Cramér-Rao bound; simultaneous perturbation (SPSA); likelihood function.. Problem Setting he isher information matrix has long played an important role in parameter estimation and system identification (e.g., Ljung, 999, pp. 5 8). In many practical problems, however, the information matrix is difficult or impossible to obtain analytically. his includes nonlinear and/or non-gaussian models as well as some linear problems (e.g., Segal and Weinstein, 988; Levy, 995). In previous work (Spall, 005), the author has presented a relatively simple Monte Carlo means of obtaining the isher information matrix for use in complex estimation settings. In contrast to the conventional approach, there is no need to analytically compute the expected value of either the essian matrix of the log-likelihood function or the outer product of the gradient of the log-likelihood function. he Monte Carlo approach can work with either evaluations of the log-likelihood function or evaluations of the gradient of the log-likelihood function, depending on what information is available. he required expected value in the definition of the information matrix is estimated via a Monte Carlo averaging combined with a simulation-based generation of artificial data. An extension to this basic Monte Carlo method is given in Das et al. (007), where it is shown that prior knowledge of some of the elements in in the information matrix can lead to improved estimates for all elements. his paper introduces two simple modifications to the approach of Spall (005) that improve the accuracy of the Monte Carlo estimate. hese modifications may be implemented with little more difficulty than the original approach. hese modifications may be used with or without the approach of Das et al. (007) for handling prior information.. he isher Information Matrix and Associated Approximations Consider a collection of n random vectors Z [z, z,, z n ] ; these vectors are not necessarily i.i.d. Let us assume that the general form for the joint probability density or probability mass (or hybrid density/mass) function for the random data matrix Z is known, but that this function depends on an unknown vector θ. Let the probability density/mass function for Z be p Z (ζ θ) where ζ ( zeta ) is a dummy matrix representing the possible outcomes for the elements in Z. he corresponding likelihood function, say (θ ζ), satisfies θ ζ) = p Z (ζ θ). (.) With the definition of the likelihood function in (.), we are now in a position to present the isher information matrix. he expectations below are with respect to the data set Z. Let L(θ) = log (θ Z) (so we are suppressing the data dependence in L). he p p information matrix (θ) for a differentiable log-likelihood function is given by L L ( θ). (.) θ θ In the case where the underlying data {z, z,, z n } are independent, the magnitude of (θ) will grow at a rate proportional to n since L will represent a sum of n random terms. he bounded quantity ( θ ) n is employed as an average information matrix over all measurements. ote also that when the data depend on some analyst-specified inputs, then (θ) also depends on these inputs. or notational convenience and since many applications depend on cases (such as i.i.d. data) where there are no 978--444-079-7/08/$5.00 008 AACC. 395

inputs we suppress this dependence and write (θ) for the information matrix. xcept for relatively simple problems, however, the form in (.) is generally not useful in the practical calculation of the information matrix. Computing the expectation of a product of multivariate nonlinear functions is usually a hopeless task. A well-known equivalent form follows by assuming that L is twice continuously differentiable in θ (a.s. in Z). hat is, the essian matrix L ( θ) θθ is assumed to exist. urther, assume that the likelihood function is regular in the sense that standard conditions such as in Wilks (96, pp. 408 4; pp. 48 49) or Bickel and Doksum (977, pp. 6 7) hold. One of these conditions is that the set {ζ: (θ ζ) > 0} does not depend on θ. A fundamental implication of the regularity for the likelihood is that the necessary interchanges of differentiation and integration are valid. hen, the information matrix is related to the essian matrix of L through: L ( θ) (.3) θθ he form in (.3) is usually more amenable to calculation than the product-based form in (.). ote that in some applications, the observed information matrix at a particular data set Z (i.e., (θ)) may be easier to compute and/or preferred from an inference point of view relative to the actual information matrix (θ) in (.3) (e.g., fron and inckley, 978). Although the method in this paper is described for the determination of (θ), the efficient essian estimation described in Section 4 may also be used directly for the determination of (θ) when it is not easy to calculate the essian directly. xpression (.3) directly motivates a Monte Carlo simulation-based approach, as given in Spall (005). Let Z pseudo (i) be a Monte-Carlo generated random matrix from the assumed distribution for the actual data based on the parameters θ taking on some specified value (typically an estimated value). ote that Z pseudo (i) represents a sample of size n, analogous to the real data Z, and that dim(z pseudo (i)) = dim(z). urther, let ˆ represent the kth estimate of (θ) at the data vector Z pseudo (i); ˆ is to be used in an averaging process as described below. As described below, the estimate ˆ is generated via efficient simultaneous perturbation (SPSA) principles (Spall, 99) using either log-likelihood L(θ) values (alone) or the gradient (score vector) g(θ) L( θ) θ if that is available. he former usually corresponds to cases where the likelihood function and associated nonlinear process are so complex that no gradients are available. o highlight the fundamental commonality of approach, let G k i (θ ) represent either a gradient approximation (based on L(θ) values) or the exact gradient g(θ), as used in the kth essian estimate at the given Z pseudo (i). Let Δ k i [Δ k i, Δ k i,, Δ kp i ] be a mean-zero random vector such that the scalar elements {Δ kj i } are independent, identically distributed, symmetrically distributed random variables that are uniformly bounded and satisfy ( Δ kj i ) <. he latter condition excludes such commonly used Monte Carlo distributions as uniform and Gaussian. urther, assume that the Δ kj i are bounded in magnitude. ote that the user has full control over the choice of the Δ kj i distribution. A valid (and simple) choice is the Bernoulli ± distribution (it is not known at this time if this is the best distribution to choose for this application). he formula for estimating the essian at the point θ is: ˆ δgki δgki = ( Δ ) + ( Δ ), (.4) c c where δg Gki ( θ + cδ ) Gki ( θ cδ ), Δ denotes the vector of inverses of the p individual elements of Δ k i, and c > 0 is a small constant. he prime rationale for (.4) is that ˆ is a nearly unbiased estimator of the unknown (θ). Spall (000) gives conditions such that the essian estimate has an O(c ) bias (the main such condition is smoothness of L(θ), as reflected in the assumption that g(θ) is thrice continuously differentiable in θ near the nominal value of interest). he symmetrizing operation in (.4) (the multiple / and the indicated sum) is convenient to maintain a symmetric essian estimate. o illustrate how the individual essian estimates may be quite poor, note that ˆ in (.4) has (at most) rank two (and may not even be positive semi-definite). his low quality, however, does not prevent the information matrix estimate of interest from being accurate since it is not the essian per se that is of interest. he averaging process eliminates the inadequacies of the individual essian estimates. he main source of efficiency for (.4) is the fact that the estimate requires only a small (fixed) number of gradient or log-likelihood values for any dimension p. When gradient estimates are available, only two evaluations are needed (i.e., the two values Gki ( θ± cδ ) = g( θ ± cδ k ) evaluated at Z pseudo (i) are used to form the essian estimate). When only log-likelihood values are available, each of the gradient approximations Gki ( θ± cδ ) require two evaluations of L( ). ence, one approximation ˆ uses four log-likelihood values. he gradient approximations at the two design levels are: 396

L( θ± cδ + c Δ ) L( θ± cδ c Δ ) G ( θ± cδk) = Δ c (.5) where Δ = [,,..., ] Δ k iδ k i Δ kpi is generated in the same statistical manner as Δ k i, but independently of Δ k i (in particular, choosing Δ kj i as independent Bernoulli ± random variables is a valid but not necessary choice), Δ denotes the vector of inverses of the p elements of Δ, and c > 0 (like c) is a small constant. he Monte Carlo approach of Spall (005) is based on a double averaging scheme. he first inner average forms essian estimates at a given Z pseudo (i) (i =,,..., ) from k =,,..., M values of ˆ and the second outer average combines these sample mean essian estimates across the values of pseudo data. herefore, the Monte Carlo-based estimate of (θ) in Spall (005), denoted, ( θ ), is: M, M i= k= M M ( θ ) ˆ, (.6) or, in equivalent recursive (in i =,,, ) form: M i ˆ M, i( θ) = M, i ( θ ) + i im (.7) k = ( M,0 = 0). he aim of this paper is to introduce two modifications to the basic form in (.6) and (.7). 3. Characterization of rror in essian stimate his section provides a few key facts about ˆ, as used in the averaging process for the estimation of the information matrix (eqn. (.6)). We will use these facts in Sections 4 and 5 to show how the accuracy can be improved relative to the basic averaging process in (.6) and (.7). he probabilistic big-o terms appearing below are to be interpreted in the almost surely (a.s.) sense (e.g., Oc ( ) implies a function that is a.s. bounded when divided by c, c 0); all associated equalities hold a.s. One of the key ways in which the error in the isher estimate will be reduced is through the use of feedback. We now present an expression for the error in the essian estimates that is useful in creating the feedback term. rom Spall (006), it is known that ˆ in (.4) can be decomposed into three parts: ˆ = () θ + Ψ + Oc ( ), (3.) where Ψ k i is a p p matrix of terms dependent on (θ), Δ k i, and, when only L values are available, the additional perturbation vector Δ (see Spall, 006). ote that Ψ k i represents the error due to the simultaneous perturbations (Δ k i and, if relevant, Δ ). he specific form and notation for the Ψ k i term depends on whether L (θ) values or g(θ) values are available, represented as Ψ k i = Ψ or Ψ k i = Ψ, respectively, in the notation below. inally, the big- O error is a reflection of the bias in the essian estimate; in the case where only L values are available, the Oc ( ) bias assumes that cc is O() (c 0). Let us define D = Δ ( Δ ) I p and D = Δ ( Δ ) I p where I p is the p p identity matrix (note that D k i and D are symmetric when the perturbations are i.i.d. Bernoulli distributed). Given D k i and D above, it is shown in Spall (006) that with the Gki ( θ± cδ ) formed from L measurements only (see (.5)): Ψki ( )= D D + D + D + D D + D + Dki, (3.) while with the Gki ( θ ± cδ ) formed from direct g values: ( ) + Ψ D D. (3.3) 4. Implementation with Independent Perturbation per Measurement Let us assume in this section that the n data vectors entering each Z pseudo (i) are mutually independent (analogous to the real data z, z,, z n being mutually independent). he basic structure in Section can be improved by exploiting this independence. In particular, the variance of the elements of the individual essian estimates ˆ ˆ ki can be reduced by decomposing into a sum of n independent estimates, each corresponding to one of the data vectors. A separate perturbation vector can then be applied to each of the independent estimates, which produces variance reduction in the resulting estimate M,. the independent perturbations above reduce the variance of the elements in the estimate of (θ) from O( ) in Spall (005) to O( ( n )). he complete version of the paper available upon request includes these results. 5. eedback-based Method for (θ) stimation Aside from the independent perturbation idea of Section 4, variance reduction is possible by using the current estimate of (θ) to reduce the variance of the next essian estimate. his, in turn, reduces the variance of the next estimate for (θ). his feedback idea applies whether or not the independent perturbations of Section 4 are used. he idea here is in the spirit of Spall (006), but the context is different in that the primary quantity of interest is (θ), not the essian matrix. In essence, the variance reduction below follows from the fundamental property that var(x) (X ) for any random variable X having a finite second moment. 397

Let M, denote the feedback-based form of this section. So M, is a direct replacement for M, in Sections 4. Using the basic recursive form in (.7) as the starting point, the feedback-based form for the estimate of the information matrix in recursive (in i) form is i Mi, ( θ) = Mi, ( θ) i M + ˆ ( M, i ( )), (5.) im Ψ θ k= where Ψ = Ψ or Ψ = Ψ from (3.) or (3.3), as appropriate ( M,0 = 0). ote that the current best estimate of (θ) (i.e., M, i ) is used in place of when evaluating Ψ ). he main result related to improved accuracy is heorem (and its Corollary ) below. Because of the frequent need to refer to the specific θ of interest, as well as expansions of functions around that point of interest, we let θ represent the particular θ at which we wish to determine (θ); also, for convenience, let = (θ ). As a first step towards analyzing the variance reduction due to feedback, we show in the Lemma below that M, + O(c ) in mean square as (fixed M). he results below apply ˆ ki when is formed from either L(θ) values (alone) or from gradient (score) values g(θ ) and/or when the independent perturbation idea of Section 4 is used. ollowing the previously established notation for the third derivatives of L, let L (θ) denote the p 4 row vector of all possible fourth derivatives of L. Vector and matrix norms are the standard uclidean norm; in the matrix case, this corresponds to the robenius norm: =, where the a ij are the components of A. i j a ij Lemma. or some open neighborhood of θ, suppose L (θ) exists continuously (a.s. in Z) and that () ) ( ˆ ) ki L θ is bounded in magnitude. urther, let < (recall that the ˆ are identically distributed for all k and i). hen, given the basic conditions on Δ k i in Section (applying also to Δ when only L values are used for ˆ ), ( M, B( θ ) ) 0 as for any fixed M and all c sufficiently small, where B(θ ) is a bias matrix satisfying B(θ ) = O(c ). Proof. Due to space limitations, we do not include the proof of the Lemma here; the proof is available upon request. A similar proof is in Spall (006). We are now in a position to establish that feedback reduces the asymptotic mean-squared error of the estimate for the information matrix. In the proofs of heorem and Corollary, we use the mean-squared convergence result A of the Lemma to establish the main result on improved accuracy. ote that heorem and Corollary only consider the p case because at p =, M, = M, due to the errors in (3.) and (3.3) being identically zero. heorem. Suppose that the conditions of the Lemma hold, p, ( ) θ <, 0, and 0. urther, suppose that for some δ > 0 and δ > 0 such that ( + δ ) + ( +δ ) + δ =, ( L () θ ) is uniformly bounded in magnitude for all θ in an open neighborhood of θ, ( + δ Δ kj i ) < and, when only L values are used, ( + δ Δ kj i ) < (arbitrary i, j, and k by the i.i.d. assumption for the perturbation elements). hen the accuracy of M, is greater than the accuracy of M, in the sense that M, lim Oc ( ) +. (5.) M, Proof. Let f, f, and f denote corresponding (arbitrary) scalar elements of M, (no feedback estimate), M,, and, respectively. hat is, f and f are estimates for one of the p( p+ ) unique elements in the information ψi matrix. urther, let ψ i, ψi, and be the corresponding scalar elements of the error matrices Ψ i( i ) ( ), Ψ i ( θ ), and Ψ i ( ) (corresponding to the component of the information matrix being estimated by f and f ). Only Ψ i( i ) is actually computed in the algorithm; however, the other two errors, Ψ i ( ( θ )) and Ψ i ( ), play a key role in the proof below (recall that (θ ), in general, depends on Z pseudo (i)). Also, let h ˆi be the scalar element of ˆ i corresponding to the component of M, and M, being estimated by f and f, respectively. Because (5.) is based on the robenius norm, it is sufficient to show that ( ) ( ) lim f f f f + O(c ) for all of the p( p+ ) unique elements. As in the proof of the Lemma, without loss of generality, take M =. rom (5.), f ( ˆ = hi ψi). i = ˆ i hen, from the fact that ( ) < (with ( ˆ ) i = ( ˆ j ) for all i and j) and fact that,i converges in 398

mean square to + B(θ ) (Lemma), it is known that the mean-squared error ( f f ) exists for all. hus, by the standard bias-variance decomposition of meansquared error (e.g., Spall, 003, p. 333), ˆ 4 = var i ψ i + i= ( f f ) ( h ) O( c ), (5.3) where the summation follows from the uncorrelatedness of the summands h ˆi ψ i, which follows by the independence across i of the Δ i (and Δ i when the estimates are based only on measurements of L), and the O(c 4 ) term follows from the fact that the elements of the bias matrix B(θ ) are O(c ) (the Lemma). Analogously, in the no-feedback case (eqns. (.6) and (.7)) 4 ( f ) var( ˆ f = h) ( ) i + O c (5.4) i= (the specific analytical form of the squared bias contribution O(c 4 ) is identical in both the feedback and non-feedback cases). rom expressions (5.3) and (5.4) for the mean-squared errors, it is clear that the relative errors in the feedback and non-feedback cases are determined by analyzing the relative behavior of var( hˆ i ψ i) and var( h ˆ i ). It can then be shown that, ˆ var( h ψ ) = var( h +ψ ψ ) + O( c ), (5.5) i i i i i ˆ var( hi) = var( hi +ψ i ) + O( c ). (5.6) In the case of having direct g values, e i is simpler: e i is a numerator of sums of elements in L ( θ ) times terms within Δ i over a denominator of one term from Δ i. he arguments above then follow analogously, leading to the relationships in (5.5) and (5.6). ote that h i and ψi ψ i are uncorrelated and h i and ψ i are uncorrelated (the uncorrelatedness at each i follows by the form for Ψ i and the independence of the Δ i, Z pseudo (i), and, when the estimates are based only on measurements of L, Δ i ). hen, the variance terms on the right-hand side of (5.5) and (5.6) satisfy i i i i i i var( h +ψ ψ ) = var( h ) + var( ψ ψ ), (5.7) var( hi +ψ i ) = var( hi) + var( ψi ). (5.8) As a vehicle towards characterizing the relative values of var( hˆ i ψ i) and var( h ˆ i ) via (5.5) (5.8), we now show that the second term on the right-hand side of (5.7) satisfies var( ψi ψi) var( ψi ψi ) O(c ) as i. ote that, var( i i ) ( i i ) ψ ψ = ψ = ( i ) ( ψi ) ψ ψ, (5.9) where the first equality follows from ( ψi ) = ( ψi ) = 0 and the second equality follows by the independence of the Δ i, Z pseudo (i), and (when the estimates are based only on measurements of L) Δ i at each i. ence, from the Lemma, lim var( ψi ψi) var( i i ) = Oc ( ) i ψ ψ, (5.0) as desired. Relative to the right-hand side of (5.8), var( i ) ψ = ( i ) ψ, (5.) implying from (5.9) that var( ψi ψi ) var( ψi ) (the inequality is strict when ( ψi ) 0; see Corollary ). herefore, by the principle of Cesàro summability (e.g., Apostol, 974, heorem 8.48), it is known from (5.5) (5.) that lim var( hˆ var ˆ i) ( h ) i i ψ i= = var( ψ ) var( ψ ψ ) 0 (5.) (it is sufficient to consider only first values, ψ and ψ, in the limiting variance due to the identical distribution of ψi and ψi ). o establish the result to be proved, we need to ensure that the indicated limit in (5.) exists. rom (5.3), (5.4), and (5.) (which show that the numerator and denominator have the same O( ) convergence rate to unique limits), the limit in (5.) exists if var( h ˆ ) > 0 for at least one of the p( p+ ) elements being estimated (for all sufficiently small c). rom (5.6) and (5.8), var( h ˆ ) = var(h ) + var( ψ ) + O(c ). In the case where (θ ) is deterministic, we know from the error decompositions in (3.) and (3.3), that var( ψ ) > 0 for at least one element because (θ ) = and it is assumed that 0 and 0. In the case where (θ ) is random (i.e., at least one element is non-degenerate random), var(h ) for the nondegenerate elements exist (by the assumption ( ( θ ) ) < ) and satisfy var(h ) > 0. ence, by (5.6) and (5.8), var( h ˆ ) > 0 for at least one element. inally, for any element such that var( h ˆ ) > 0, we know by (5.3), (5.4), (5.5), (5.6), and (5.) that ( f f ) var( ) var( ) ( ) lim h + ψ ψ + O c = var( h ( f ) ) var( ) O( c ) f + ψ + + Oc ( ). (5.3) 399

he result to be proved in (5.) then follows by the fact that the results in (5.5) (5.) show that any elements with var( h ˆ ) = 0 do not contribute to the norms in either the numerator or denominator. Q..D. Corollary establishes conditions for strict inequality in (5.). he proof rests on the following: he solution matrix X to the equation AX + XB = 0 is unique (X = 0) if and only if the square matrices A and B have no eigenvalues in common (Lancaster and ismenetsky, 985, p. 44). Corollary to heorem. Suppose that the conditions of heorem hold, rank( ), and that the elements of Δ k i Δ are generated according to the Bernoulli ± and distribution. hen the inequality in (5.) is strict (i.e., < instead of ). Proof. rom (5.) and the definition of robenius norm, it is sufficient to show that the inequality in (5.3) is strict for at least one scalar element. rom (5.), this strict inequality follows if var( ψi ψi ) < var( ψi ), which, as noted below (5.), holds when ( ψi ) 0, requiring that ψi cannot be 0 a.s. or both the case of L measurements and g measurements, we now establish that ψi cannot be 0 a.s. for at least one scalar element. Let rank( ) = r p, and without loss of generality assume that the elements of θ are ordered such that the upper left r r block of is full rank; in the arguments below, a subscript r r denotes the upper left r r block of the indicated matrix. or the case of L measurements, we know from (3.), Ψ ( ) Δ, Δ,..., Δ ( i i i ri ) = ( D i + D i Δ i, Δ i,..., Δ ri ) D ir ; r 0 D ir ; r 0 = + 0 0 0 0 where the first equality follows by the symmetry of and D i (the latter due to the assumed Bernoulli distribution for the perturbations) and the fact that D i has mean 0 while the second equality follows from D i having mean 0 (the D ir ; r submatrix that remains is from the indicated conditioning). Because r r and r r have no eigenvalues in common, the above-mentioned result in Lancaster and ismenetsky (985, p. 44) indicates that ( Ψ i ( ) Δ i, Δ i,..., Δ ri ) = 0 (a necessary condition for Ψ ( ) = 0) if and only if D ir ; r = 0. i, his cannot happen since D ir ; r takes on r unique values (r ), none of which are 0. Likewise, for the case of g measurements, we know from (3.3) that ( Ψ i ( ) Δ i, Δ i,..., Δ ri ) is equal to the right-most expression above for ( Ψ i ( ) Δ i, Δ i,..., Δ ri ). ence, the same reasoning applies using the result in Lancaster and ismenetsky (985, p. 44). We thus know that ψi cannot be 0 a.s. for at least one element of ( L ) Ψ i ( ) or Ψ ( ) (as relevant), completing the proof. Q..D. inal remarks and acknowledgment: A numerical study has been performed on the feedback idea in Section 5, showing strong improvement in the quality of the estimate on the problem setting considered in Spall (005); details are available upon request. his work was partially supported by U.S. avy Contract 0004-03-D-6606. References Apostol,. M. (974), Mathematical Analysis (nd ed.), Addison- Wesley, Reading, MA. Bickel, P. J. and Doksum, K. (977), Mathematical Statistics: Basic Ideas and Selected opics, olden-day, San rancisco. Das, S., Spall, J. C., and Ghanem, R. (007), An fficient Calculation of isher Information Matrix: Monte Carlo Approach Using Prior Information, Proceedings of the I Conference on Decision and Control, 4 December 007, ew Orleans, LA, USA, paper WeB.4. fron, B. and inckley, D. V. (978), Assessing the Accuracy of the Maximum Likelihood stimator: Observed versus xpected isher Information (with discussion), Biometrika, vol. 65, pp. 457 487. Lancaster, P. and ismenetsky, M. (985), he heory of Matrices (nd ed.), Academic, ew York. Levy, L. J. (995), Generic Maximum Likelihood Identification Algorithms for Linear State Space Models, Proceedings of the Conference on Information Sciences and Systems, March 995, Baltimore, MD, pp. 659 667. Ljung, L. (999), System Identification heory for the User (nd ed.), Prentice all PR, Upper Saddle River, J. Segal, M. and Weinstein,. (988), A ew Method for valuating the Log-Likelihood Gradient (Score) of Linear Dynamic Systems, I ransactions on Automatic Control, vol. 33, pp. 763 766. Spall, J. C. (99), Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation, I ransactions on Automatic Control, vol. 37, pp. 33 34. Spall, J. C. (000), Adaptive Stochastic Approximation by the Simultaneous Perturbation Method, I ransactions on Automatic Control, vol. 45, pp. 839 853. Spall, J. C. (003), Introduction to Stochastic Search and Optimization: stimation, Simulation, and Control, Wiley, oboken, J. Spall, J. C. (005), Monte Carlo Computation of the isher Information Matrix in onstandard Settings, Journal of Computational and Graphical Statistics (American Statistical Assoc.), vol. 4, pp. 889 909. Spall, J. C. (006), Convergence Analysis for eedback- and Weighting-Based Jacobian stimates in the Adaptive Simultaneous Perturbation Algorithm, Proceedings of the I Conference on Decision and Control, 3 5 December 006, San Diego, CA, pp. 5669 5674 (paper rb3.). Wilks, S. S. (96), Mathematical Statistics, Wiley, ew York. 400