Lecture 10: Parameter Estimation for ODE Models

Size: px

Start display at page:

Download "Lecture 10: Parameter Estimation for ODE Models"

Wesley Lester
6 years ago
Views:

1 Computational Biology Group (CoBI), D-BSSE, ETHZ Lecture 10: Parameter Estimation for ODE Models Prof Dagmar Iber, PhD DPhil BSc Biotechnology 2016/17

2 December 7, / 93 Contents 1 Review of Previous Lectures 2 ODE Systems 3 Identifiability 4 Optimization Algorithms 5 Quality of the Estimate

3 December 7, / 93 Literature D.S. Sivia, Data Analysis, OUP Jaqaman & Danuser, Linking data to models: data regression. Nat Rev Mol Cell Biol (2006) 7,

4 Review of Previous Lectures Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

5 Bayes Theorem for Parameter Estimation According to Bayes Theorem prob(x D, I ) = prob(d X, I ) prob(x I ) prob(d I ) prob(x D, I ): posterior probability density function (pdf) that we want to determine. prob(d X, I ) : likelihood function prob(x I ): prior probability density function (pdf) that reflects our knowledge about the system prob(d I ): evidence, i.e. the likelihood of the data based on our knowledge. Here one could incorporate knowledge about the quality of different experimental techniques or experimental groups. Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

6 December 7, / 93 Maximum likelihood estimate prob(x D, I ) prob(d X, I ) (1) posterior pdf likelihood function. Maximum likelihood estimate Our best estimate X 0, given by the maximum of the posterior, is equivalent to the solution that yields the greatest value for the probability of the observed data.

7 December 7, / 93 Assume Gaussian Process for Data Acquisition We can then write prob(d k X, I ) = 1 σ k 2π exp ( (F k D k ) 2 2σ 2 k ). (2) We can rewrite this equation as prob(d X, I ) exp ) ( χ2 2 with χ 2 = k R 2 k = k ( ) Fk D 2 k. σ k Residuals The R k = F k D k σ k are referred to as residuals.

8 December 7, / 93 Least-squares estimate Take the logarithm of the posterior pdf: Least-squares estimate L = ln (prob(x D, I )) = const χ2 2. (3) Since the maximum of the posterior will occur when χ 2 is smallest, the corresponding optimal solution X 0 is called least-squares estimate.

9 A simple ODE Model Problem Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

10 December 7, / 93 Example: Fitting parameters in an ODE model We now consider a simple ODE of the form dy dt = f (y) = py y(0) = 1. (4) We have time-dependent data Y k with variance σ k at time points t k. We seek to estimate the decay rate p. As before, we write L = const χ2 2 with χ 2 = k R 2 k = k ( ) y(tk ) Y 2 k. (5) σ k

11 Example: Fitting parameters in an ODE model We now seek to maximise the likelihood, L = const χ2 2 which is equivalent to minimising (6) χ 2 = k R 2 k = k ( ) y(tk ) Y 2 k. (7) σ k As discussed in the previous lecture, we can use the Newton-Raphson algorithm: given an initial estimate for p = p 1 we iterate according to p N+1 = p N [ L] 1 L, (8) where the RHS is evaluated at p N, the estimate after N 1 iterations. Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

12 December 7, / 93 Example: Fitting parameters in an ODE model By differentiating ( ) y(tk ) Y 2 k. (9) χ 2 = k R 2 k = k σ k with respect to p we obtain (ignoring second order derivatives) L(p N ) = 1 2 χ2 = k (y(t k ) Y k ) y(t k) p σ 2 k (10) L = 1 2 χ2 = k ( y(t k) p )2 σk 2. (11) L < 0 so that we indeed obtain the maximal likelihood.

13 December 7, / 93 Example: Fitting parameters in an ODE model L(p N ) = 1 2 χ2 = k L = 1 2 χ2 = k (y(t k ) Y k ) y(t k) p σ 2 k ( y(t k) p )2 σk 2. So how can we determine the sensitivities S k = y(t k) p? We notice that d dy dp dt = d dy dt dp = df (y) dp = f (y) dy f (y) + = J dy f (y) + y dp p dp p (12)

14 December 7, / 93 Example: Fitting parameters in an ODE model L(p N ) = 1 2 χ2 = k L = 1 2 χ2 = k d( dy dp ) dt = J dy f (y) + dp p (y(t k ) Y k ) y(t k) p σ 2 k ( y(t k) p )2 σk 2. f (y) The Jacobian J and p can be calculated ahead of time and the entire equation can then be integrated alongside the integration of the model ODE.

15 Example: Fitting parameters in an ODE model dy dt ds dt = f (y) = py; y(0) = 1 = JS + f (y) p = ps y; S(0) = 0 with Jacobian J = f (y) y OPTIMIZATION: = p, f (y) p = y, and sensitivity S = y p. p N+1 = p N [ L] 1 L L(p N ) = 1 2 χ2 = k (y(t k ) Y k ) y(t k) p σ 2 k L = 1 2 χ2 = ( y(t k) p )2 σ 2. k k Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

16 December 7, / 93 Example: Fitting 2 parameters in an ODE model We now estimate two parameter values for a simple ODE of the form dy dt = f (y) = ρ dy; y(0) = 10. (13) The parameter vector is then given by p = [ρ, d] T. OPTIMIZATION: p N+1 = p N [ L] 1 L L = 1 2 χ2 = k L = 1 2 χ2 = (y(t k ) Y k ) k k σ 2 k ( y(tk ) ρ y(t k ) d ( 1 y(tk ) σk 2 ρ 1 y(t k ) y(t k ) σk 2 ρ d ) ) 2 k k 1 y(t k ) y(t k ) σk 2 ρ d ( ) 2 1 y(tk ) σk 2 d.

December 7, 2016 17 / 93 Example: Likelihood Function dependent on parameter values dy dt

17 December 7, / 93 Example: Likelihood Function dependent on parameter values dy dt Optimizing the IC and d. = f (y) = ρ dy; y(0) = 10. Optimizing the correlated production and decay rates.

18 Identifiability Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

19 December 7, / 93 Identifiability Given the structure of a model, is it possible to uniquely estimate its unknown parameters? What experimental data are required to achieve unique parameter identification? Whether the inverse problem is solvable is dependent on the mathematical model the significance of the data the experimental errors.

20 December 7, / 93 Global versus Local Identifiability Global Identifiability A parameter is globally identifiable if it can be uniquely determined given the input profile and assuming continuous and error-free data for the observables of the model. Local Identifiability If there is a countable number of solutions the parameter is locally identifiable. Unidentifiability It is unidentifiable if there exist uncountable many solutions.

21 December 7, / 93 Regression instability Unidentifiable parameters might lead to regression instability. Regression instability A measure for the variation of regression results in the presence of data noise. A regression is unstable if the estimates of model parameters significantly differ when one additional data point is added to the set of input data.

22 December 7, / 93 Structural Identifiability A model is structurally identifiable if its parameters can be uniquely estimated by fitting the model to experimental data. Structural identifiability needs to be tested BEFORE the regression. If some of the necessary data are not available, then structural identifiability analysis reveals which parameters must be eliminated a priori. Importantly, the structural identifiability of a model is independent of the regression scheme that is employed.

23 December 7, / 93 Structural Identifiability Structural identifiability is related to the sensitivity of process output to parameter variations. Sensitivity Analysis A sensitivity analysis evaluates the dependence of a system s output on a certain set of model parameters. Let m = {m 1, m 2,..., m M } be a set of M measurable output and p = {p 1, p 2,..., p P } the set of P unknown model parameters. The sensitivity coefficients Sp m parameters are defined as of the output with respect to the S m i p j = m i p j, i = 1,..., M j = 1,..., P. (14)

24 December 7, / 93 Structural Identifiability Structural Identifiability A model is structurally identifiable if its sensitivity matrix satisfies two conditions: S = m 1 m 1 p P p m M p 1 m M p P (15) each column has at least one large entry (i.e. each parameter has a large impact on at least one experimental measurement) the matrix has full rank (i.e. all columns must be linearly independent, which means that the effects of the parameters on the measurements must be independent of each other.)

25 December 7, / 93 Comments on the Sensitivity analysis As the sensitivities are state and time dependent, the way that the system is observed has a strong impact on the calculated sensitivities. This analysis can be used to suggest new informative experiments. However, since the sensitivities depend on the initial guess of the parameter values and sensitivities usually depend non-linearly on parameters, the results are valid only for a specific parameter set. The sensitivity analysis may have to be carried out over a large set of biologically plausible parameter values.

26 December 7, / 93 Practical Identifiability A parameter that is structurally identifiable may still be practically nonidentifiable. This can arise due to insufficient amount and quality of experimental data or the chosen measurement time points. It manifests in a confidence interval that is infinite, although the likelihood has a unique minimum for this parameter. Practical nonidentifiability A parameter is practically nonidentifiable if the likelihood-based confidence region is infinitely extended in the direction of parameter p i indicated by the likelihood staying below a desired threshold.

December 7, 2016 27 / 93 Identifiability LHS: Stuctural Nonidentifiability: If parameters are correlated then only relative values can be determined for the correlated parameters since

27 December 7, / 93 Identifiability LHS: Stuctural Nonidentifiability: If parameters are correlated then only relative values can be determined for the correlated parameters since their effects compensate for each other. Center: Practical nonidentifiability in a 2D plot can be visualized as a relatively flat valley, which is infinitely extended. RHS: Identifiable

28 December 7, / 93 Identifiability: profile likelihood In case of practical nonidentifiability (panels c, d) the corresponding profile likelihood reveals that the set threshold change in χ 2 is not exceeded in one direction.

29 December 7, / 93 Addressing a practical nonidentifiability By increasing the amount and quality of the measured data and/or the choice of measurement time points t i, stricter conditions on the parameter estimation are imposed. This leads to a tightening of confidence intervals that ultimately will remediate a practical nonidentifiability, yielding finite confidence intervals.

30 General ODE Systems Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

31 December 7, / 93 A general ODE Model with many parameters Consider now a dynamical system with N state variables which we describe by a set of ordinary differential equations: d x(t) dt = f ( x(t), t, k), x(t 0 ) = x 0. (16) x(t): vector with all state variables k: vector with all parameters x 0 : vector for all initial expressions

32 December 7, / 93 Observables Often, the state variables cannot be directly observed. Rather, it is combinations of state variables, or relative quantities (or possibly even more complicated functions of the state variables) that are measured in experiments.

33 December 7, / 93 Observables We specify an observation function g : R N R M which maps the state variables x to a set of M observables, y(t) = g( x(t), s) (17) We require both f ( ) and g( ) to be continuously differentiable functions with respect to their parameters. The vector s comprises the parameters of the observation function. Note that we may be able to only partially observe the system such that M < N.

34 December 7, / 93 Parameters The set of problem specific parameters p includes the initial conditions, the model parameters, and the parameters that are specific to the measurements, i.e. p = { x 0, k, s}. (18) The initial values of the dynamical system are also parameters as they are usually unknown.

35 December 7, / 93 Measurement Data We denote the measured data by y ij. Generally, these measurements y ij are subject to error, i.e. they are the sum of the observables y j (t i ) and a measurement error, ɛ ij, i.e. y ij = y j (t i ) + ɛ ij We assume that the measurement errors, ɛ ij, are independent across all observations and all time points that the measurement errors follow independent Gaussian distributions, i.e. ɛ ij N(0, σ ij ) (19) where σ ij is the standard deviation of the measurement errors.

36 December 7, / 93 Type of Error Due to the law of large numbers applies in many practical settings. ɛ ij N(0, σ ij ) (20) However, other distributions, in particular log-normal distributions, are also encountered, for instance when protein concentrations are low. Therefore, normality should be checked in the course of data pre-processing and the data should be transformed appropriately in case of deviations from normality. The observation function g( ) can take account for this transformation.

37 Maximum Likelihood The optimal parameter set is the one with the highest probability of observing the data and can be determined by maximizing the likelihood L of the data y with respect to the parameter set p ( ) T M 1 L( y p) = exp 1 (g j ( x(t i, p), p)) y ij ) 2. σ ij 2π 2 i=1 j=1 σ 2 ij Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

38 December 7, / 93 Log-likelihood In practical terms, to find the maximum of the likelihood function the negative log-likelihood is minimized log[l( y p)] = T M i=1 j=1 1 2 R ij( p) 2 + c ij, (21) R ij ( p) = g j( x(t i, p), p)) y ij σ ij, c ij = log[σ ij 2π]. R ij is called residual. The term c ij is independent of p, and can be left out of the minimization.

39 December 7, / 93 The maximum likelihood estimator The maximum likelihood estimator for the model parameters is thus given by log[l( y p)] T M i=1 j=1 1 2 R ij( p) 2. (22) In the background of independent Gaussian measurement errors the parameters p can therefore be determined by least squares minimization.

40 December 7, / 93 Minimization Common methods to minimize log[l( y p)] are gradient-based such as the classical Newton or Levenberg-Marquardt methods. They are based on an iterative parameter update procedure which computes the next update step by exploiting the gradient of the weighted residuals R ij.

41 December 7, / 93 Gradient of the weighted residuals R ij R ij ( p) = 1 dg j ( x(t i, p l ), p l )) p l σ ij dp l ( = 1 N g j dx n σ ij x n dp l + g ) j ti p l n=1 ti ti (23) g j x n and g j p l are the Jacobians of the differential equation system with respect to the state variables and with respect to the parameters. Sp n l = (dx n )/(dp l ) are the sensitivities of the state variables to changes in the parameter values that we discussed above.

42 December 7, / 93 Calculation of the Sensitivities In general there is no analytic solution to the trajectories x(t, p) and therefore the sensitivities Sp n l = (dx n )/(dp l ) have to be calculated numerically. Naively, one may approximate them by finite differences, which is also the default in many optimization software packages. However, this approximation becomes computationally very expensive in case of a high dimensional parameter space as it demands many integrations of the differential equations.

43 December 7, / 93 Calculation of the Sensitivities Alternatively, the sensitivities can be computed by an integration of the sensitivity equations (as discussed above) in parallel with the ODE model. dsp n l = d dx n = d dx n dt dt dp l dp l dt = df (t, x(t), k) = dp l ( ) { Sp n dxq 1 : pl {x l (0) = (0) = 0 } dp l 0 : p l {s, k} N q=1 f n x q dx q dp l + f n p l

44 December 7, / 93 Practical Considerations Since f ( ) and g( ) are specified beforehand, their respective derivatives with respect to the parameters p can also be computed beforehand, either manually or for more complex systems by applying symbolic computations. As a note aside, the Jacobians should also be provided to the ODE solvers for the differential equation and the sensitivity equations since these will speed up the calculations considerably.

45 December 7, / 93 Practical Considerations Parameter optimization in high-dimensional parameter spaces crucially relies on an efficient and reliable computation of the sensitivities and gradients of the residuals. Solving for the sensitivities by numerical integration has the advantage of getting the sensitivity of certain system properties in parallel with the solution of the ODE model. The limitation of the approach lies in the large number of additional ODEs that need to be solved!!

46 December 7, / 93 Workflow of gradient based minimization procedures Initialize model system and parameters LOOP Integrate ODE and sensitivity equations Calculate Jacobian of residuals Calculate residuals IF change in norm(residuals) < threshold BREAK ELSE Update parameter values using current parameter values and Jacobian ENDIF ENDLOOP Calculate fit statistics, parameter variances and confidence limits

47 December 7, / 93 Workflow of gradient based minimization procedures The procedure starts with an initial set of parameter values. In each cycle of the iteration, the ODE system is solved and the gradient of the weighted residuals is calculated. Based on the current parameter values and the gradient, new parameter values are computed that minimize the sum of squared residuals (residual norm). This update step lies at the heart of gradient based minimization functions such as the Levenberg-Marquardt procedure in MATLAB s lsqnonlin. The iteration continues until the change in residual norm is smaller than a predefined value. Usually, the (relative) change in norm of the residuals is used to check for the convergence of the minimization procedure by some predefined threshold.

48 December 7, / 93 Optimization Algorithms 1 Local Methods Gradient-based Methods Newton-Raphson Iterative Algorithm Levenberg-Marquard Direct, derivative-free Methods Simplex Methods Nelder-Mead Method Conjugate Gradient Method 2 Global Methods Simulated Annealing Evolutionary Algorithms

49 Example: TGF-β Model Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

December 7, 2016 50 / 93 Example: Time Courses Sensitivity analysis of the TGFβ signaling model.

(B) Time resolved sensitivities of parameters controlling the dynamics of nuclear Smad2 /Smad4 complex

50 December 7, / 93 Example: Time Courses Sensitivity analysis of the TGFβ signaling model. (A) Time course of the stimulation protocol and the system output, nuclear Smad2 /Smad4 complex. (B) Time resolved sensitivities of parameters controlling the dynamics of nuclear Smad2 /Smad4 complex and of the nuclear Smad2 /Smad4 complex. For details see main text. (C) Clustergram of the steady state control coefficients.

51 December 7, / 93 Performance of different optimization procedures. The table provides a comparison of the Matlab algorithm lsqnonlin to alternative minimizers in the Systems Biology Toolbox ( Av. comp. time (cputime 10 3 ) % correct matches Best perf. (χ 2 ) Max iter. NonLinLS tribes Gen. Alg Simplex Chc PSO Hill Cmaes Different optimizers were run on the same data set. The first column show the average computation time of each method. Every algorithm was run 5 times with different initial parameter values. The second column measures the number of times the algorithm produces an acceptable result (χ 2 < 30). The third column is the average χ 2 value at the optimum. The last column is the maximum number of iterations we allowed for each algorithm before it was stopped. Abbreviations: NonLinLS: Matlab lsqnonlin routine. simpex: Downhill Simplex Method in Multidimensions. cmaes: (15,50)-Evolution Strategy with Covariance Matrix Adaptation. chc: Clustering multi-start hill climbing. ga: Standard Genetic Algorithm with elitism. hill: Multi-start hill climbing. pso: Particle Swarm Optimization with constriction. tribes: adaptive Particle Swarm Optimization.

52 December 7, / 93 Performance Depending on the complexity of the model, the minimization step itself may only be a minor contributor to the total computation time, as is the case for the here-considered TGF-β model. The main computational burden is created by the need to solve the system of ODEs for the state variables and for the sensitivities many times. This is why it is important to provide the Jacobian to the ODE solver. We notice that the different algorithms do, however, differ in how well the optimal parameter set can be estimated. We recommend to compare results from different optimizers.

53 Quality of the Estimate Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

54 December 7, / 93 The concept of best estimate and error-bars Consider the probability density function (pdf) for the parameter X P = prob(x D, I ). (24) Then the best estimate X 0 is given for dp dx = 0, (25) X 0 assuming that X is a continuous parameter. Strictly speaking we also require d 2 P dx 2 < 0. (26) X 0

55 December 7, / 93 Logarithm of pdf To evaluate the reliability of the estimate we explore the neighborhood of the best estimate X 0 with a Taylor series. Rather than dealing directly with the posterior pdf P which is a peaky and positive function, it is better to work with the logarithm L, since this varies much more slowly with X. L = ln(prob(x D, I )), (27)

56 December 7, / 93 Reliability of the estimate Expanding L = ln(prob(x D, I )) about the point X = X 0, we have L = L(X 0 ) + 1 d 2 L 2 dx 2 (X X 0 ) X 0 where the best estimate of X is given by the condition dl dx = 0. (28) X 0 because L is a monotonic function of P.

57 December 7, / 93 Reliability of the estimate L = L(X 0 ) d 2 L dx 2 (X X 0 ) X 0 The first term L(X 0 ) is constant and says nothing about the shape of P(X ). The linear term is missing because dl X 0 = 0. The quadratic term is therefore the dominant factor determining the width of the posterior pdf and plays a central role in the reliability analysis. dx

58 December 7, / 93 Ignoring all the higher order contributions we have L = L(X 0 ) + 1 d 2 L 2 dx 2 (X X 0 ) X 0 [ 1 P = prob(x D, I ) = exp (L) A exp 2 d 2 ] L dx 2 (X X 0 ) 2 X 0 So the posterior pdf can be approximated by a Gaussian distribution: prob(x µ, σ) = 1 ] [ σ 2π exp (x µ)2 2σ 2 with ( d σ 2 2 ) L 1 = dx 2 X 0 µ = X 0.

59 Multi-dimensional Problems Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

60 December 7, / 93 Linear Models We now consider a vector X of parameters. Assume that we have a model f that is linear in the parameter values such that F = f (X, k) = M T k,j X j + C k (29) j=1 where the sum is over all parameters M. In matrix-vector notation we then have F = TX + C. (30)

61 Recall L = const χ2 2 with χ 2 = k R 2 k = k ( ) Fk D 2 k. (31) σ k We can then determine the first derivative of the likelihood function L with respect to parameter X j as Since L = 1 χ 2 = X j 2 X j F = f (X, k) = N ( Fk D k k=1 σ 2 k ) Fk X j. (32) M T k,j X j + C k (33) j=1 we have F k = T kj = const. (34) X j Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

62 L = 1 χ 2 = X j 2 X j N ( Fk D k k=1 For the second derivative we then have 2 L N T ki T kj = X i X j σk 2 σ 2 k ) Fk X j. (35) F k X j = T kj = const. (36) k=1 = const. (37) All higher derivatives are therefore zero. We can therefore expand the likelihood function around its maximum L 0 as L = L 0 + i L(X i X i0 ) + 1 M M ( L) ij (X i X i0 )(X j X 0j )(38) 2 i i=1 j=1 }{{}}{{} =0 1 2 Q Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

63 December 7, / 93 Hessian Matrix L L = L 0 + i L(X i X i0 ) + 1 M M ( L) ij (X i X i0 )(X j X 0j ) 2 i i=1 j=1 }{{}}{{} =0 1 2 Q Since we expand around the maximum L 0 the first derivative is by definition zero. But what about the second term, Q? For L to have a maximum at X 0, we, of course, require for all X X 0. Q = ( X X 0 ) T [ L]( X X 0 ) < 0 (39)

64 December 7, / 93 Hessian Matrix must be negative definite For L to have a maximum at X 0 for all X X 0. Q = ( X X 0 ) T [ L]( X X 0 ) < 0 (40) This means that the ( L) ij, the Hessian matrix of L, must be negative definite. Negative definite matrix A negative definite matrix is a Hermitian matrix all of whose eigenvalues are negative.

65 December 7, / 93 Conditions for maximum Likelihood To evaluate Q we seek to diagonalize the Hessian matrix of L. To diagonalize we need to determine the eigenvalues λ and eigenvectors [x, y] T of ( L). In the following we will write ( L) = 2 L Xi 2 2 L X j X i 2 L X i X j 2 L Xj 2 [ A C = C B ]. (41) Thus, [ A C C B ] ( x y ) ( x = λ y ). (42) Q < 0 for all ( X X 0 ) if λ 1, λ 2 < 0, i.e. if the trace of ( L) is negative (A + B < 0) and the determinant positive, AB > C 2.

66 December 7, / 93 The contour of Q Recall, that we can rewrite where Q = ( X X 0 ) T [ L]( X X 0 ) = ( X X 0 ) T EΛE 1 ( X X 0 ) = n λ j (E 1 ( X X 0 )) 2 j. (43) j=1 E is a square matrix whose columns are the eigenvectors of [ L] in any order, Λ is the diagonal matrix such that Λ ii is the eigenvalue associated to column i of E.

67 December 7, / 93 Derivation We first transform to the new coordinate system spanned by the eigenvectors, such that x = E y (44) where the columns of E are the normalised eigenvectors of [ L]. Since the eigenvectors are orthogonal to each other, this means that (E T E) ij = δ ij and (E T [ L]E) ij = λ j δ ij, (45) where δ ij = 1 if i = j and zero otherwise.

68 December 7, / 93 Derivation We now write x = ( X X 0 ) = E y. With this change of variables, we have Q = ( X X 0 ) T [ L]( X X 0 ) = x T [ L] x = (E y) T [ L](E y) = y T (E T [ L]E) y 2 = λ j yj 2, and y = E 1 x. (46) j=1 We now see that the contour of Q is an ellipse, centered at (X 0, Y 0 ), with axes in direction of the eigenvectors.

69 December 7, / 93 The contour of Q Suppose we want to determine all points x = ( X X 0 ) = E y for which Q = k. The contour of Q is an ellipse, centered at (X 0, Y 0 ). The principal axes of the ellipse correspond to the eigenvectors of [ L]. Within our quadratic approximation the posterior pdf is constant on each level line. For a given contour level (Q = k) the size of the ellipse is determined by the eigenvalues.

70 December 7, / 93 The contour of Q The contour of Q is an ellipse, centered at (X 0, Y 0 ). The principal axes of the ellipse correspond to the eigenvectors of [ L]. Within our quadratic approximation the posterior pdf is constant on each level line. For a given contour level (Q = k) the size of the ellipse is determined by the eigenvalues.

71 December 7, / 93 Variance & Co-Variance We are now interested in the variances and co-variances to determine the quality of the parameter estimate. Variance & Co-Variance The covariance between two jointly distributed real-valued random variables X and Y with finite second moments is defined as σ(x, Y ) = E [ (X E[X ])(Y E[Y ]) ], (47) where E[X ] is the expected value of X, also known as the mean of X.

72 December 7, / 93 Variance & Co-Variance By using the linearity property of expectations, this can be simplified to σ(x, Y ) = E [(X E [X ]) (Y E [Y ])] = E [XY X E [Y ] E [X ] Y + E [X ] E [Y ]] = E [XY ] E [X ] E [Y ] E [X ] E [Y ] + E [X ] E [Y ] = E [XY ] E [X ] E [Y ]. (48)

73 December 7, / 93 Variance & Co-Variance For random vectors X and Y (both of dimension m) the m m cross covariance matrix (also known as dispersion matrix or variance-covariance matrix, or simply called covariance matrix) is equal to σ(x, Y) = E [ (X E[X])(Y E[Y]) T] (49) = E [ XY T] E[X] E[Y] T, (50) where m T is the transpose of the vector (or matrix) m.

74 December 7, / 93 Variance & Co-Variance The (i,j)-th element of the matrix σ(x, Y) is equal to the covariance Cov(X i, Y j ) between the i-th scalar component of X and the j-th scalar component of Y. Note, that Cov(Y, X ) is the transpose of Cov(X, Y ). For a vector X = [ X 1 X 2... X m ] T of m jointly distributed random variables with finite second moments, its covariance matrix is defined as Σ(X) = σ(x, X). (51) Random variables whose covariance is zero are called uncorrelated.

75 December 7, / 93 Variance & Co-Variance We first estimate the variance of X, σx 2. If we are interested only in X then we can integrate out Y as a nuisance parameter prob(x D, I ) = prob(x, Y D, I )dy = exp (L)dy ( = exp L 0 + Q ) dy 2 ( = exp L ) 2 (Ax 2 + By 2 + 2Cxy) dy ( = exp L ) 2 (Ax 2 ) exp ( 1 ( By 2 + 2Cxy) ) dy 2

76 December 7, / 93 To complete the square we substitute such that By 2 + 2Cxy = B prob(x D, I ) exp ( y + Cx ) 2 (Cx)2 B B. (52) ( 1 2 (A C 2 ) B )x 2 exp ( 1 2 B (y + φ)2 dy φ = Cx B is an irrelevant offset as the integral extends from minus infinity to plus infinity.

77 December 7, / 93 Recall that Thus, exp ( y 2 ) 2σ 2 dy = σ 2π. (53) ( ) 1 2π exp B (y + φ)2 dy = 2 B with σ 2 = 1 B. (54)

78 December 7, / 93 We then have ( 1 prob(x D, I ) exp 2 (A C 2 ) B )x 2 exp ( 1 2 B (y + φ)2 dy ( 2π 1 = B exp 2 (A C 2 ) B )x 2 ( ) 2π x 2 = B exp (55) with σ 2 X = 2σ 2 X B AB C 2. (56)

79 December 7, / 93 Variance & Co-Variance We thus have for the variance in X σ 2 X = B AB C 2. (57) σ 2 Y can be obtained in a similar way by integrating out X, Moreover, σ 2 Y = σ 2 X,Y = A AB C 2. (58) C AB C 2. (59)

80 December 7, / 93 Co-variance Matrix [ σ 2 x σ 2 xy σ 2 xy σ 2 y ] = 1 [ B C AB C 2 C A ] [ A C = C B ] 1 = [ L] 1 AB C 2 is the determinant of [ L]. For such a real symmetric matrix, this is equal to the product of its eigenvalues. Consequently, if either λ 1 or λ 2 becomes very small, so that the ellipse is extremely elongated in one of its principal directions, then AB C 2 0; consequently, apart from the special case when C = 0 (no co-variance), σ x and σ y will be huge.

81 December 7, / 93 Co-variance Matrix Consequently, if either λ 1 or λ 2 becomes very small, so that the ellipse is extremely elongated in one of its principal directions, then AB C 2 0; consequently, apart from the special case when C = 0 (no co-variance), σ x and σ y will be huge.

82 December 7, / 93 Correlation The correlation coefficient ρ(x, Y ) between two random variables X and Y with expected values E[X ] = µ X and E[Y ] = µ Y and standard deviations σ X and σ Y is defined as: ρ X,Y = corr(x, Y ) = cov(x, Y ) σ X σ Y = E[(X µ X )(Y µ Y )] σ X σ Y. (60)

83 December 7, / 93 Weak Correlation When the correlation is negligible then σ 2 XY σ 2 xσ 2 y.

84 December 7, / 93 Strong Correlation [ σ 2 x σ 2 xy σ 2 xy σ 2 y ] = 1 [ B C AB C 2 C A ] [ B C = C A ] 1 An extreme situation is obtained for C 2 = AB. In that case the elliptical contours will be infinitely wide in one direction (with only the information of the prior preventing the catastrophe!) and oriented at an angle tan 1 A B with respect to the X -axis. Although the error-bars σ X and σ Y will be huge, i.e. the individual estimates of X and Y will be unreliable, the large off-diagonal elements of the co-variance matrix tells us that we can still infer a linear combination of these parameters very well.

85 December 7, / 93 Co-variance & Correlation In the first image the co-variance (correlation) is zero. In the second image, the co-variance is large and negative (Y + mx = const). In the third image, the co-variance is large and positive (Y mx = const).

86 December 7, / 93 Error Bars An illustration between the (correct) marginal error-bar σ ii and its misleading best-fit counterpart indicated by the bold line in the middle. The former correct one is given by the ith diagonal element of ( L) 1, while the latter is related to the inverse of L. Xi 2

87 December 7, / 93 Asymmetric and multimodel posterior pdfs A schematic illustration of two posterior pdfs for which the multivariate Gaussian is NOT a good approximation.

88 December 7, / 93 Asymmetric and multimodel posterior pdfs Here the concepts of error-bar and linear correlation associated with the covariance matrix are not suitable. We might think of generalizing the idea of confidence intervals to a multidimensional space,but it will usually be hard to describe the surface of the (smallest) hypervolume containing 90% of the probability (say) in just a few numbers. The situation may be even worse when the pdf has several maxima. This relates to the challenge of finding global maxima in a large multidimensional parameter space when there are many (small) little hillocks.

89 December 7, / 93 Co-Variance Matrix & Hessian Matrix We notice that [σ 2 ] ij = [( L) 1 ] ij = 2[ (χ 2 )] 1 ij = [H 1 ] ij (61) where H refers to the Hessian matrix. [σ 2 ] ij is referred to as covariance matrix C C = < (X i X i0 )(X j X 0j ) >= [σ 2 ] ij = [( L) 1 ] ij = 2[ (χ 2 )] 1 ij = H 1 (62) In conclusion, the likelihood function (and thus the posterior pdf) are defined completely by the optimal solution X 0 and the second derivative of L at the maximum L 0 which corresponds to the covariance matrix C.

90 December 7, / 93 Example 1: Fitting two parameter of a straight line In the next step we want to estimate both m and c. We use vector-matrix formulation and write ( ) ( ) mn+1 mn = [ L] 1 L N. (63) c n+1 c n where L N = 1 σk 2 xk 2 L = k σk 2 k (m N x k +c N Y k )x k k σ k 2 m N x k +c N Y k k σk 2 x k σ 2 k x k k σ k 2 k 1 σk 2,. (64)

91 December 7, / 93 As seen above the variance of the estimate can be determined as σ 2 = [ L] 1 1 k 1 = k 1 x 2 σk 2 k k ( σ 2 k k x k σk 2 k ) σ 2 ) x k k 2 k σk 2 k x k σk 2 xk 2 σk 2 (65 If we do not know the error-bars for the data, then we might do the analysis by assuming that they are all of the same size, i.e. σ k = σ and replace σ by the estimate S: S 2 = 1 N 1 (m 0 x k + c 0 Y k ) 2 (66) where m 0, c 0 denote the best estimate of the parameters. k

92 December 7, / 93 Summary & Outlook 1 Pre-regression Diagnostics: Identifiability 2 Bayesian Inference 3 Optimization Algorithms Next Step: Post-regression Diagnostics

93 December 7, / 93 Thanks!! Thanks for your attention! Slides for this talk will be available at:

Lecture 10: The quality of the Estimate

Computational Biology Group (CoBI), D-BSSE, ETHZ Lecture 10: The quality of the Estimate Prof Dagmar Iber, PhD DPhil BSc Biotechnology 2015/16 May 12, 2016 2 / 102 Contents 1 Review of Previous Lectures