Lecture 10: The quality of the Estimate

Size: px

Start display at page:

Download "Lecture 10: The quality of the Estimate"

Meagan Cook
5 years ago
Views:

1 Computational Biology Group (CoBI), D-BSSE, ETHZ Lecture 10: The quality of the Estimate Prof Dagmar Iber, PhD DPhil BSc Biotechnology 2015/16

2 May 12, / 102 Contents 1 Review of Previous Lectures 2 Post-regression diagnostics Goodness of fit (GOF) Uncertainty in the Parameters 3 Model Selection 4 Incorporation of prior information 5 Error Propagation

3 May 12, / 102 Literature D.S. Sivia, Data Analysis, OUP Jaqaman & Danuser, Linking data to models: data regression. Nat Rev Mol Cell Biol (2006) 7,

4 May 12, / 102 Parameter Inference 1 Pre-regression Diagnostics: Identifiability 2 Bayesian Inference 3 Post-regression Diagnostics

5 Review of Previous Lectures Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

6 May 12, / 102 Structural Identifiability Structural Identifiability A model with M state variables ( y) and P parameters ( p) is structurally identifiable if its sensitivity matrix S m i p j = y i p j, i = 1,..., M j = 1,..., P. (1) satisfies two conditions: each column has at least one large entry (i.e. each parameter has a large impact on at least one experimental measurement) the matrix has full rank (i.e. all columns must be linearly independent, which means that the effects of the parameters on the measurements must be independent of each other.)

May 12, 2016 7 / 102 Structural & Practical Identifiability If parameters are correlated then only

7 May 12, / 102 Structural & Practical Identifiability If parameters are correlated then only relative values can be determined for the correlated parameters since their effects compensate for each other.

8 Bayes Theorem for Parameter Estimation According to Bayes Theorem prob(x D, I ) = prob(d X, I ) prob(x I ) prob(d I ) prob(x D, I ): posterior probability density function (pdf) that we want to determine. prob(d X, I ) : likelihood function prob(x I ): prior probability density function (pdf) that reflects our knowledge about the system prob(d I ): evidence, i.e. the likelihood of the data based on our knowledge. Here one could incorporate knowledge about the quality of different experimental techniques or experimental groups. Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

9 May 12, / 102 Maximum likelihood estimate prob(x D, I ) prob(d X, I ) (2) posterior pdf likelihood function. Maximum likelihood estimate Our best estimate X 0, given by the maximum of the posterior, is equivalent to the solution that yields the greatest value for the probability of the observed data.

10 May 12, / 102 Assume Gaussian Process We can then write prob(d k X, I ) = 1 σ k 2π exp ( (F k D k ) 2 2σ 2 k ). (3) We can rewrite this equation as prob(d X, I ) exp ) ( χ2 2 with χ 2 = k ( ) Fk D 2 k = R 2 k σ k σ 2 k k. Residuals The R k = F k D k are referred to as residuals.

11 May 12, / 102 Least-squares estimate Take the logarithm of the likelihood function: Least-squares estimate L = ln (prob(d X, I )) = const χ2 2. (4) Since the maximum of the posterior will occur when χ 2 is smallest, the corresponding optimal solution X 0 is called least-squares estimate.

12 May 12, / 102 Parameter Estimation for ODE Models Consider a dynamical system with N state variables which we describe by a set of ordinary differential equations: d x(t) dt = f ( x(t), t, k), x(t 0 ) = x 0. (5) x(t): vector with all state variables k: vector with all parameters x 0 : vector for all initial expressions

13 Observables Often, the state variables cannot be directly observed. We specify an observation function g : R N R M which maps the state variables x to a set of M observables, y(t) = g( x(t), s) (6) We require both f ( ) and g( ) to be continuously differentiable functions with respect to their parameters. Note that we may be able to only partially observe the system such that M < N. The parameter vector p now comprises the kinetic parameters, k, the initial conditions x 0, and the parameters of the observation function, s, such that p = { x 0, k, s}. (7) Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

14 Maximum Likelihood The optimal parameter set is the one with the highest probability of observing the data and can be determined by maximizing the likelihood, prob( y p) of the data y ij with respect to the parameter set p prob( y p) = T M i=1 j=1 ( 1 exp 1 σ ij 2π 2 ) (g j ( x(t i, p), p)) y ij ) 2. σ 2 ij Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

15 May 12, / 102 Log-likelihood In practical terms, to find the maximum of the likelihood function the negative log-likelihood, L, is minimized L = log[prob( y p)] = T M i=1 j=1 1 2 R ij( p) 2 + c ij, R ij ( p) = g j( x(t i, p), p)) y ij σ ij, c ij = log[σ ij 2π]. R ij is called residual. The term c ij is independent of p, and can be left out of the minimization.

16 May 12, / 102 The maximum likelihood estimator The maximum likelihood estimator for the model parameters is thus given by log[l( y p)] T M i=1 j=1 1 2 R ij( p) 2. (8) In the background of independent Gaussian measurement errors the parameters p can therefore be determined by least squares minimization.

17 May 12, / 102 Optimization Algorithms 1 Local Methods Gradient-based Methods Newton-Raphson Iterative Algorithm Levenberg-Marquard Direct, derivative-free Methods Simplex Methods Nelder-Mead Method Conjugate Gradient Method 2 Global Methods Simulated Annealing Evolutionary Algorithms

18 May 12, / 102 Gradient of the weighted residuals R ij R ij ( p) = 1 dg j ( x(t i, p l ), p l )) p l σ ij dp l ( = 1 N g j dx n σ ij x n dp l + g ) j ti p l n=1 ti ti (9) g j x n and g j p l are the Jacobians of the differential equation system with respect to the state variables and with respect to the parameters. Sp n l = (dx n )/(dp l ) are the sensitivities of the state variables to changes in the parameter values that we discussed above.

19 May 12, / 102 Calculation of the Sensitivities The sensitivities can be computed by an integration of the sensitivity equations (as discussed above) in parallel with the ODE model. dsp n l = d dx n = d dx n dt dt dp l dp l dt = df (t, x(t), k) = dp l ( ) { Sp n dxq 1 : pl {x l (0) = (0) = 0 } dp l 0 : p l {s, k} N q=1 f n x q dx q dp l + f n p l

20 May 12, / 102 Workflow of gradient based minimization procedures Initialize model system and parameters LOOP Integrate ODE and sensitivity equations Calculate Jacobian of residuals Calculate residuals IF change in norm(residuals) < threshold BREAK ELSE Update parameter values using current parameter values and Jacobian ENDIF ENDLOOP Calculate fit statistics, parameter variances and confidence limits

21 Post-regression diagnostics Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

22 The Workflow Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

23 May 12, / 102 Post-regression diagnostics After fitting a regression model it is important to determine whether all the necessary model assumptions are valid before performing inference. In constructing our regression models we assumed that the errors were independent and normal random variables with mean 0 and constant variance σ 2 ij. Model diagnostic procedures involve both graphical methods and formal statistical tests. These procedures allow us to explore whether the assumptions of the regression model are valid and decide whether we can trust subsequent inference results.

24 Goodness of fit (GOF) Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

25 May 12, / 102 Goodness of fit We have so far assumed that the measurement error is Gaussian distributed. The weighted residuals, R ij ( p) = g j( x(t i, p), p)) y ij σ ij, (10) must then also be Gaussian distributed with unit variance. The sum of squared residuals follows a χ 2 distribution with T M degrees of freedom. T M i=1 j=1 R ij (p) 2 χ 2 df (p), df = T M. (11)

26 May 12, / 102 Chi-square distribution By definition, the Chi-square distribution is the distribution of the sum of the squared values of the observations drawn from the N(0, 1) distribution. It is denoted by the symbol χ 2, which is pronounced Ky square. More precisely, and more formally : Let {X 1, X 2,..., X n } be n independent random variables, all N(0, 1). Then the χ 2 n is defined as the distribution of the sum{x X X 2 n }. So there is not one χ 2 distribution, but a family of distributions, indexed by n. This parameter is called the number of degrees of freedom of the distribution. The Chi-square distribution with n degrees of freedom is therefore the distribution of the sum of n independent squared random variables all N(0, 1).

27 May 12, / 102 What does the χ 2 value tell us? T M i=1 j=1 R ij (p) 2 χ 2 df (p), df = T M Intuitively we expect from a good fit that the deviations of the model from the data should be of the same order as the measurement error, i.e. R ij 1, which means that the sum should be centered around T M. A much larger χ 2 value than T M indicates some variation in the data which is not accounted for by the model.

28 May 12, / 102 Goodness of fit (GOF) test T M i=1 j=1 R ij (p) 2 χ 2 df (p), df = T M. A goodness of fit (GOF) test calculates the probability of observing an as large or larger value than the value of χ 2 (p ) at the minimum. Note however that we have adjusted the parameters in order to minimize χ 2 df (p) and therefore the degrees of freedom in the GOF test are df = T M p. Usually a cut-off value such as Pr[χ 2 df (p )] < 0.05 is used to reject the fit.

29 The GOF test is strictly one sided: a much smaller value of χ 2 df (p ) than expected does not necessarily indicate an overfitting, which would be the result if the model would fit the particular realization of the measurement error. Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102 Limitations T M i=1 j=1 R ij (p) 2 χ 2 df (p), df = T M. In some cases, an exceptionally large value of χ 2 df (p ), i.e., a small probability Pr[χ 2 df (p )], does not necessarily result from a bad model fit. It can also result from an underestimation of the measurements errors.

30 Example: TGF-β Model Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

31 May 12, / 102 Example: Time Courses Sensitivity analysis of the TGFβ signaling model. (A) Time course of the stimulation protocol and the system output, nuclear Smad2 /Smad4 complex. (B) Time resolved sensitivities of parameters controlling the dynamics of nuclear Smad2 /Smad4 complex and of the nuclear Smad2 /Smad4 complex. For details see main text. (C) Clustergram of the steady state control coefficients.

32 May 12, / 102 Performance of different optimization procedures. The table provides a comparison of the Matlab algorithm lsqnonlin to alternative minimizers in the Systems Biology Toolbox ( Av. comp. time (cputime 10 3 ) % correct matches Best perf. (χ 2 ) Max iter. NonLinLS tribes Gen. Alg Simplex Chc PSO Hill Cmaes Different optimizers were run on the same data set. The first column show the average computation time of each method. Every algorithm was run 5 times with different initial parameter values. The second column measures the number of times the algorithm produces an acceptable result (χ 2 < 30). The third column is the average χ 2 value at the optimum. The last column is the maximum number of iterations we allowed for each algorithm before it was stopped. Abbreviations: NonLinLS: Matlab lsqnonlin routine. simpex: Downhill Simplex Method in Multidimensions. cmaes: (15,50)-Evolution Strategy with Covariance Matrix Adaptation. chc: Clustering multi-start hill climbing. ga: Standard Genetic Algorithm with elitism. hill: Multi-start hill climbing. pso: Particle Swarm Optimization with constriction. tribes: adaptive Particle Swarm Optimization.

33 Uncertainty in the Parameters Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

34 May 12, / 102 Uncertainty in the Parameters Suppose we have estimated the parameter vector ˆp. How close is this estimate to the true parameter vector p? To this end we define the vector P = ˆp p to specify the difference.

35 May 12, / 102 Confidence Intervals There is no simple and generally valid way of calculating confidence limits of parameters for all problems faced in nonlinear optimization. However, in our context there is an approximate result which is valid in the limit of infinitely many data points and complete parameter identifiability. Specifically, one can relate the variance in the parameters to the curvature of the χ 2 df (p ) function at its minimum in order to derive parameter covariances and asymptotic confidence intervals.

36 May 12, / 102 Covariance Matrix We have previously noted that the covariance matrix Σ for our parameter estimate is related to the inverse of the Hessian of the likelihood function, We will now revisit this idea. Σ = [ L] 1. (12)

37 May 12, / 102 Maximum Likelihood Estimate The log-likelihood function L is given as L = ln (prob(d X, I )) = const χ2 2. (13) For easier readability, we will now write p = prob(d X, I )). (14) At the maximum, i.e. for the optimal parameter vector X, we have L = L ln (p) X = X = 1 p p X = 0. (15)

38 May 12, / 102 Maximum Likelihood Estimate By definition, we have pdd = 1 (16) Differentiating with respect to the parameter vector X and using we have L = 1 p pdd = p X = p p = 0. (17) p LdD = 0. (18)

39 May 12, / 102 Maximum Likelihood Estimate pdd = p LdD = 0. (19) Differentiating again with respect to the parameter vector X, we obtain pdd = 0 = p LdD + p( L) 2 dd + p LdD p LdD (20) We thus have E[( L) 2 ] = E[ L]. (21)

40 May 12, / 102 Fisher-Information Matrix E[( L) 2 ] = E[ L]. (22) For large E[ L] we have large curvature of the likelihood function at its optimum and thus tight confidence intervals and thus more information. Accordingly, we have Information Matrix is called the Information Matrix. FIM = E[ L] = E[( L) 2 ] (23)

41 May 12, / 102 The Cramer-Rao Bound The Cramer-Rao lower bound is a useful, maybe the best statistical indicator for the errors made in estimating the true parameter values. The Cramer-Rao Bound The so-called Cramer-Rao inequality provides a lower bound to the variance of an unbiased estimator, as will be seen in the sequel.

42 May 12, / 102 The Cramer-Rao Bound Let X e (D) be any estimator of the true parameter value X based on the measurements D. We will now drop the vector arrows for better readability. Let X e (D) = E[X e (D)] be the expectation of the estimate. Its variance is given by σ 2 Xe = E[X e(d) X e (D)) 2 ]. (24)

43 May 12, / 102 Bias in an estimator The bias in an estimator is defined as E[X e (D) X ] = X e (D)p(D X, I )dd X = x(x ). (25) If x(x ) = 0, then it is called an unbiased estimator because then on average the expected value of the estimate is the same as the true parameter. Bias in general cannot be determined since it depends on the true value of the parameter, which in practice is unknown. Often the estimates would be biased, if the noise were not zero mean.

44 May 12, / 102 Bias in an estimator We now have X e (D)p(D X, I )dd = X + x(x ). (26) Differentiating on both sides with respect to (w.r.t.) X we get X e (D) p(d X, I )dd = since X e (D) is a function of the data D only. X e (D)( L)pdD = 1 + x(x ). (27)

45 May 12, / 102 Bias in an estimator Multiplying pdd = with X e (D) and adding it to yields X e (D) p(d X, I )dd = p LdD = 0. (28) X e (D)( L)pdD = 1 + x(x ). (29) (X e (D) X e (D))( L)pdD = 1 + x(x ) (X e (D) X e (D)) p( L) pdd = 1 + x(x ). (30)

46 Cauchy-Schwarz inequality Now we are ready to apply the well-known Cauchy-Schwarz inequality [ 2 f (z)g(z)dz] Equality applies if f (z) = kg(z), for a constant k. You may know the form for vectors: Cauchy-Schwarz inequality f (z) 2 dz g(z) 2 dz. (31) The Cauchy-Schwarz inequality states that for all vectors x and y of an inner product space it is true that x, y 2 x, x y, y, (32) where, is the inner product also known as dot product. Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

47 May 12, / 102 Cauchy-Schwarz inequality We thus have (X e (D) X e (D)) p( L) pdd = 1 + x(x ). (33) (1 + x(x )) 2 = ( (X e (D) X e (D)) p( L) ) 2 pdd (X e (D) X e (D)) 2 pdd } {{ } σxe 2 ( L) 2 pdd } {{ } FIM

48 May 12, / 102 Cramer-Rao Inequality σxe 2 (1 + x(x ))2 FIM 1 (34) For an unbiased estimator, x(x ) = 0, and hence σxe 2 FIM 1. (35) The equality sign holds if X e (D) X e (D) = k L. (36) For an unbiased, efficient estimator, we thus have Cramer-Rao Bound σ 2 Xe = FIM 1 = E[ L] 1 = E[( L) 2 ] 1. (37)

49 May 12, / 102 The MLE is efficient For efficiency we have to show that At the optimum we have X e (D) X e (D) = k L. (38) L = 0. (39) In case of an unbiased estimate, x(x ) = 0, we also have at the optimum X e (D) X e (D) = 0. (40) Hence, the equality is established and the ML estimator is proved efficient. This is a very important property of the ML estimator.

50 May 12, / 102 Fisher-Information Matrix The log-likelihood function L is given as L = ln (p(d, X )). (41) Given a probability density function p(d, X ), the Fisher Information Matrix is thus defined as [ ] [ ] log p log p FIM = E = E 2 log p Σ 1. (42) X i X j X i X j It is not a coincidence that the Fisher Information Matrix appears to be the reciprocal of the accuracy with which we can expect to be able to estimate X given an observation data D. As the variance decreases, the amount of information increases.

51 Determination of FIM Recall that for parameters X L = log (p) = T M i=1 j=1 1 2 R ij( X ) 2 + c ij = const χ2, such that with R ij ( X ) = g j( x(t i, X ), X )) y ij σ ij, c ij = log[σ ij 2π]. log (p) X l = χ χ X l = T M i=1 j=1 R ij ( X ) = 1 dg j ( x(t i, X l ), X l )) = 1 X l σ ij dx l σ ij R ij ( X ) R ij X l ( N g j x n n=1 ti dx n dx l + g ) j ti X l Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102 ti

52 May 12, / 102 Determination of FIM with log (p) = T i=1 j=1 R ij ( X ) = 1 dg j ( x(t i, X l ), X l )) = 1 X l σ ij dx l σ ij M R ij ( X ) R ij ( X ) ( N g j x n n=1 ti dx n dx l + g ) j ti X l ti We can further approximate ( L) kl = (log (p)) kl T M i=1 j=1 where we neglected the second order derivative. R ij X k T M i=1 j=1 R ij X l

53 May 12, / 102 Fisher Information Matrix (FIM) The Fisher Information Matrix can then be calculated from the residuals during minimization. [ ] FIM ij = E 2 log p S T S (43) X i X j The approximation neglects the second derivative terms but is computationally inexpensive as S l = ij ( R ij/ X l ) is the calculated gradient matrix of the residuals during minimization.

54 May 12, / 102 Confidence Intervals Asymptotic confidence intervals can be calculated by taking into account the distribution of the χ 2 values, which are approximately Gaussian for large degrees of freedom. The 95% confidence intervals given by p ± 1.96 diag(c). (44) Note that this asymptotic result is just a lower bound on the uncertainty of the parameter estimate (the Cramer-Rao bound).

55 May 12, / 102 Observed Fisher-Information Matrix In the following we will approximate [ log p FIM = E X i ] log p S T S. (45) X j Strictly, this definition corresponds to the observed Fisher information. If no expectation is taken we obtain a data-dependent quantity that is called the observed Fisher information. Based on the sensitivity matrix S or rather on the Fisher information matrix (FIM), there are a number of easy-to-compute indicators.

56 The Quality of the Parameter Estimate Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

57 May 12, / 102 Confidence intervals Assuming that all other parameters are exact, a confidence interval for a specific parameter is the intersection of the ellipsoidal region with the parameter axis. This is the dependent confidence interval: D p i = C(α)/ [(S T S) ii ] 1 (46). The independent confidence interval is given by the projection of the ellipsoidal region onto the parameter axis: I p i = C(α)/ ([S T S] 1 ) ii (47).

58 May 12, / 102 Dependent & Independent Confidence Intervals If dependent and independent confidence intervals are similar and small, ˆp i is well-determined. In case of a strong correlation between parameters, the dependent confidence intervals underestimate the confidence region, whereas the independent confidence intervals overestimate it.

59 May 12, / 102 Correlation and Co-Variance Another way to obtain information about the correlations between parameters is to look at the covariance matrix cov = (S T S) 1. The correlation coefficient of the ith and jth parameter is given by:. cor ij = cov ij cov ii cov jj (48)

60 May 12, / 102 Interpretation in terms of eigenvalues & eigenvectors Using the singular value decomposition for S, S = UΦV T, (49) where U is an an unitary matrix (U T U = UU T = I ) and V T is the conjugate transpose of the unitary matrix V, we get S T S = V (ˆp)Φ T U T UΦV (ˆp) T = V (ˆp)Φ T ΦV (ˆp) T (50) where the eigenvectors of S T S are the columns of the matrix of V (ˆp).

61 May 12, / 102 Interpretation in terms of eigenvalues & eigenvectors So, the principal axes of the ellipsoidal confidence region are given by the singular vectors, the column vectors of the matrix V (ˆp), and the length of the principal axes is proportional to the reciprocal of the corresponding singular values, the diagonal elements of Φ(ˆp).

62 May 12, / 102 Interpretation in terms of eigenvalues & eigenvectors Using the transformation (rotation): the equation for the ellipsoid can be rewritten as: z = V T (ˆp)P (51) m σi 2 zi 2 C(α) (52) i=1 Note that C(α) is approximately proportional to the variance in the measurement errors.

63 May 12, / 102 Practical identifiable The precise definition of practical identifiable depends on the level of accuracy, r e, one requires for the parameter estimates. This defines the sphere: m zi 2 = re 2 (53) i=1 To be able to determine z i accurately enough, the radius along the ellipsoid s ith principal axis should not exceed the radius of the sphere, which leads to the following inequality: σ i C(α) r e (54)

64 May 12, / 102 Practical identifiable Suppose that only the first k largest singular values satisfy C(α) σ i r e, then only the first k entries of z are estimated with the required accuracy. If a principal axis of the ellipsoid makes a significant angle with the axis in parameter space (i.e. there exists more than one significant entry in the eigenvector), this corresponds to the presence of correlation among parameters in ˆp. In this case, only a combination of parameters can be determined.

65 May 12, / 102 Practical identifiable To summarize, the level of noise in the data, in combination with the accuracy requirement for the parameter estimates, defines the threshold for significant singular values in the matrix Φ. The number of singular values exceeding this threshold determines the number of parameter relations that can be derived from the experiment. How these relations relate to the individual parameters is described by the corresponding columns in the matrix V.

66 How to improve your estimates? m σi 2 zi 2 C(α) (55) i=1 indicates that having, for example, two times more accurate data so that the standard deviation σ is halved will decrease the radii along the ellipsoid s principal axes by a factor of 2. Therefore, in case of very small singular values φ i (i.e. strongly elongated ellipsoids), more accurate data obtained by the experimentalist will not improve much the quality of the corresponding parameter estimates. In such a case, one certainly needs additional measurements of a different type (e.g. different components, different time points, or in the case of partial differential equations, different spatial points). Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

67 May 12, / 102 Parameter Correlation and Identifiability Frequently, the optimization procedure does not yield a unique optimal parameter set, because there is no unique optimal χ 2 (p ) value given the available data. In this case the value of some or all parameters are non-identifiable. Non-identifiability is the result of a non-unique χ 2 minimum, which can be caused, e.g., by a very flat χ 2 landscape. The latter implies a functional relationship between parameters along which the χ 2 value is unaltered. As a result parameter estimates are highly correlated. There are three common ways to deal with non-identifiability.

68 May 12, / 102 Addressing Parameter Correlations 1 Fix some of the non-identifiable parameter at educated values and only estimate the remaining parameters. These estimates are of course biased since their optimum is in a functional relation to the fixed parameters. 2 Subsequent analyses can be based on all admissible parameter sets and the parameter sets can then be clustered according to the predictions derived from them. 3 Reduce the model such that it does not contain the non-identifiable parameters, e.g. by course graining the model.

69 May 12, / 102 Non-Identifiability & Data Quality It is noteworthy that non-identifiability of parameters does not imply a poor fit to the data, but that parameter values cannot be constrained to a unique value. The predictive power of the model will therefore be limited to model predictions that are not sensitive to non-identifiable parameters.

70 May 12, / 102 Bootstrap Methods In many real world settings the uncertainty in parameter estimates is larger due to a limited amount of data. Here, an alternative but computationally more expensive way to determine parameter uncertainties, called bootstrap, is more appropriate. Bootstrap methods construct the empirical distribution of the parameter estimates by an repeated data resampling and consecutive parameter estimation. In this way, parameter uncertainties can be inferred from the shape of the empirical parameter distributions.

71 Example: TGF-β Model Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

May 12, 2016 72 / 102 Example: Time Courses Sensitivity analysis of the TGFβ signaling model.

(B) Time resolved sensitivities of parameters controlling the dynamics of nuclear Smad2 /Smad4 complex

72 May 12, / 102 Example: Time Courses Sensitivity analysis of the TGFβ signaling model. (A) Time course of the stimulation protocol and the system output, nuclear Smad2 /Smad4 complex. (B) Time resolved sensitivities of parameters controlling the dynamics of nuclear Smad2 /Smad4 complex and of the nuclear Smad2 /Smad4 complex. For details see main text. (C) Clustergram of the steady state control coefficients.

73 Example: Sensitivities & Correlations Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

74 Example: Sensitivities & Correlations Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

75 Model Selection Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

76 May 12, / 102 Bayesian Information Criterion (BIC) In case of large datasets, the Bayesian Information Criterion (BIC) is more appropriate than test-based compare-and-rank models (CRM). This method assigns a score based on its likelihood, the number m of estimated free parameters it it and the number n of fitted data points. The BIC for a model M is given by 2 ln p(d M) BIC = 2 ln ˆL + m (ln(n) ln(2π)). (56) where ˆL is the maximised value of the likelihood. For large n, this can be approximated by: BIC = 2 ln ˆL + m ln(n). (57)

77 May 12, / 102 Bayesian Information Criterion (BIC) BIC = 2 ln ˆL + m ln(n). (58) In the case of normally distributed measurements χ 2 min = 2 ln ˆL + C (59) for some constant C, which does not vary between candidate models but depends only upon the data points. The BIC decreases as the likelihood increases, and increases as the number of parameters increases. Among competing models, the model that minimises the BIC is the most suitable to describe the available data.

78 May 12, / 102 Bayesian Information Criterion (BIC) BIC = 2 ln ˆL + m ln(n). (60) Because the first term grows linearly with the number n of fitted data points, while the second term is proportional to ln n, the penalty for having too many parameters is diminished as the data set gets larger.

79 Incorporation of prior information Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

80 May 12, / 102 Incorporation of prior information In most practical cases we have some prior knowledge about plausible parameter ranges. In case of physical constants we know that the parameters are constant, and in case of biochemical binding and reaction rates we know plausible ranges. Further experiments may further restrict these bounds. We will now discuss how such information can be incorporated in the parameter estimation process.

81 May 12, / 102 Incorporation of prior information Recall that prob(x D, I ) = prob(d X, I ) prob(x I ) prob(d I ) (61) So far we have assumed that the prior is constant, i.e. prob(x I ) =const. We will now assume that we know a range in which the parameter value must lie.

82 May 12, / 102 Incorporation of prior information If we already knew that X j = x 0j ± ɛ j, for example, then the assignment of an uncorrelated Gaussian pdf for the prior of X yields [ ] M 1 prob(x I ) = exp (X j x 0j ) 2 ( ɛ j 2π 2ɛ 2 = exp C ) (62) j 2 where j=1 C = M ( ) Xj x 2 0j. (63) ɛ j j=1 For the logarithm of the posterior pdf we then have L = ln [prob(x D, I )] = constant 1 2 [χ2 + C]. (64)

83 Error Propagation Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

84 May 12, / 102 Error Propagation - changing variables The Problem: Suppose that we have the probability distribution function (pdf) for X. What is the pdf for Y if Y = f (X )?

85 May 12, / 102 Error Propagation - changing variables Imagine taking a very small interval δx about some arbitrary point X = X ; the probability that X lies in the range X δx /2 to X + δx /2 is given by ( prob X δx 2 X X + δx 2 ) I prob(x = X I )δx (65) where the equality becomes exact in the limit δx 0.

86 May 12, / 102 Error Propagation - changing variables Now Y = f (X ) will map the point X = X to Y = f (X ) and the interval δx to δy. prob(x = X I )δx = prob(y = Y I )δy. (66) As this must be true for any point in X -space, in the limit of infinitesimally small intervals, we obtain the relationship prob(x I ) = prob(y I ) dy dx. (67) The term dy = df is the Jacobian. dx dx

87 May 12, / 102 Error Propagation in case of several variables If we want to write the pdf for M parameters {X j } in terms of the same quantities {Y j }, then we require prob({x j } I )δx 1 δx 2 δx M = prob({y j } I )δ M Vol({Y j }). (68) where δ M Vol({Y j }) is the M-dimensional volume in Y -space mapped out by the small hypercube region δx 1 δx 2 δx M in X -space. δ M Vol({Y j }) = (Y 1, Y 2,, Y M ) (X 1, X 2,, X M ) δx 1δX 2 δx M (69) where the quantity in the modulus sign is the multivariate Jacobian.

88 May 12, / 102 Error Propagation - several variables Error Propagation - several variables prob({x j } I ) = prob({y j } I ) d(y 1, Y 2,, Y M ) d(x 1, X 2,, X M )

89 May 12, / 102 Error Propagation - an example Consider the transformation of a pdf defined on a two-dimensional Cartesian grid (x, y) to its equivalent form in polar coordinates (R, θ). We have For the determinant of the Jacobian x = R cos θ y = R sin θ (70) (x, y) (R, θ) = cos θ sin θ R sin θ R cos θ = R(cos2 θ + sin 2 θ) = R. (71) Therefore: prob(r, θ I ) = prob(x, y I ) R. (72)

90 May 12, / 102 Error Propagation - an example Thus, if the pdf for x and y was an isotropic, bivariate Gaussian prob(x, y I ) = 1 [ 2πσ 2 exp (x 2 + y 2 ] ) 2σ 2 (73) then the corresponding pdf for R and θ would take the form prob(r, θ I ) = R ] [ 2πσ 2 exp R2 2σ 2. (74)

91 May 12, / 102 Error Propagation - an example Finally, we determine the pdf for the radius R by marginalizing the joint pdf prob(r, θ I ) over θ. prob(r I ) = 2π 0 prob(r, θ I )dθ = R [ ] σ 2 exp R2 2σ 2. (75)

92 May 12, / 102 Error Propagation Consider Z = f (X, Y ) = X + Y : prob(z I ) = = = prob(z X, Y, I )prob(x, Y I )dxdy δ(z f (X, Y ))prob(x, Y I )dxdy δ(z (X + Y ))prob(x, Y I )dxdy. (76) where the Dirac δ-function in the second line is zero unless Z = f (X, Y ).

93 May 12, / 102 Error Propagation Assume further X = x 0 ± σ x, Y = y 0 ± σ y, and that these parameters are uncorrelated. Then: prob(z I ) = dx prob(x I ) δ(z (X + Y ))prob(y I )dy. Since the Dirac δ-function is infinitely sharp (but has unit area), the Y -integrand is zero unless Y = Z X : prob(z I ) = prob(x I )prob(y = Z X I )dx.

94 May 12, / 102 Error Propagation prob(z I ) = prob(x I )prob(y = Z X I )dx. Since X = x 0 ± σ x, Y = y 0 ± σ y use Gaussian pdfs prob(z I ) = 1 2πσ x σ y exp ( (X x 0) 2 ) exp 2σ 2 x ( (Z X y 0) 2 2σ 2 y ) dx.

95 May 12, / 102 Error Propagation prob(z I ) = 1 2πσ x σ y exp ( (X x 0) 2 ) exp 2σ 2 x ( (Z X y 0) 2 2σ 2 y ) dx. After some tedious algebra prob(z I ) = 1 exp ( (Z z 0) 2 ) 2πσz 2σ 2 z where z 0 = x 0 + y 0 and σ 2 z = σ 2 x + σ 2 y.

96 May 12, / 102 Error Propagation prob(z I ) = 1 exp ( (Z z 0) 2 ) 2πσz 2σ 2 z where z 0 = x 0 + y 0 and σ 2 z = σ 2 x + σ 2 y. The pdf for Z = X Y is the same, except that z 0 = x 0 y 0.

97 May 12, / 102 A more intuitive approach Suppose we perturb Z = X Y. Intuitively we might have guessed, that the best estimate for the difference of the two parameters would be x 0 y 0. Let s now have a look at the corresponding error-bar. Z Z 0 = (X X 0 ) (Y Y 0 ) δz = δx δy Recall: σx 2 =< (X X 0) 2 >= (X X 0 ) 2 prob(x, Y {data}, I ) dxdy Thus: < δx 2 >= σx 2, < δy 2 >= σy 2, < δx δy >= 0

98 May 12, / 102 A more intuitive approach Since we have < δz 2 > = < (δx δy ) 2 > = < δ 2 X + δy 2 2δX δy > = < δ 2 X > + < δy 2 > 2 < δx δy > < δx 2 >= σx 2, < δy 2 >= σy 2, < δx δy >= 0 σ Z = < δz 2 > = σ 2 X + σ2 Y

99 May 12, / 102 A more intuitive approach Let s next consider Z = X /Y. Using the quotient rule of differentiation we have This can be rewritten as δz = Y δx X δy Y 2 δz Z = δx X δy Y

100 A more intuitive approach Squaring both sides and taking expectation values, we obtain < δz 2 > z 2 0 = < δx 2 > x < δy 2 > y 2 0 < δx δy > x 0 y 0. Here the X, Y, Z in the denominators were replaced by x 0, y 0, z 0 = x 0 y 0 because we are interested in deviations from the optimal solution. Thus finally, σ z z 0 = σ 2 x x σ2 Y y0 2. Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology May 12, / 102

101 May 12, / 102 WARNING This intuitive approach may no longer work if the prior cuts the posterior pdf, i.e. because the parameter values must be positive!!

102 May 12, / 102 Thanks!! Thanks for your attention! Slides for this talk will be available at:

Lecture 10: Parameter Estimation for ODE Models

Computational Biology Group (CoBI), D-BSSE, ETHZ Lecture 10: Parameter Estimation for ODE Models Prof Dagmar Iber, PhD DPhil BSc Biotechnology 2016/17 December 7, 2016 2 / 93 Contents 1 Review of Previous