Lecture 10: Parameter Estimation for ODE Models

Size: px
Start display at page:

Download "Lecture 10: Parameter Estimation for ODE Models"

Transcription

1 Computational Biology Group (CoBI), D-BSSE, ETHZ Lecture 10: Parameter Estimation for ODE Models Prof Dagmar Iber, PhD DPhil BSc Biotechnology 2016/17

2 December 7, / 93 Contents 1 Review of Previous Lectures 2 ODE Systems 3 Identifiability 4 Optimization Algorithms 5 Quality of the Estimate

3 December 7, / 93 Literature D.S. Sivia, Data Analysis, OUP Jaqaman & Danuser, Linking data to models: data regression. Nat Rev Mol Cell Biol (2006) 7,

4 Review of Previous Lectures Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

5 Bayes Theorem for Parameter Estimation According to Bayes Theorem prob(x D, I ) = prob(d X, I ) prob(x I ) prob(d I ) prob(x D, I ): posterior probability density function (pdf) that we want to determine. prob(d X, I ) : likelihood function prob(x I ): prior probability density function (pdf) that reflects our knowledge about the system prob(d I ): evidence, i.e. the likelihood of the data based on our knowledge. Here one could incorporate knowledge about the quality of different experimental techniques or experimental groups. Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

6 December 7, / 93 Maximum likelihood estimate prob(x D, I ) prob(d X, I ) (1) posterior pdf likelihood function. Maximum likelihood estimate Our best estimate X 0, given by the maximum of the posterior, is equivalent to the solution that yields the greatest value for the probability of the observed data.

7 December 7, / 93 Assume Gaussian Process for Data Acquisition We can then write prob(d k X, I ) = 1 σ k 2π exp ( (F k D k ) 2 2σ 2 k ). (2) We can rewrite this equation as prob(d X, I ) exp ) ( χ2 2 with χ 2 = k R 2 k = k ( ) Fk D 2 k. σ k Residuals The R k = F k D k σ k are referred to as residuals.

8 December 7, / 93 Least-squares estimate Take the logarithm of the posterior pdf: Least-squares estimate L = ln (prob(x D, I )) = const χ2 2. (3) Since the maximum of the posterior will occur when χ 2 is smallest, the corresponding optimal solution X 0 is called least-squares estimate.

9 A simple ODE Model Problem Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

10 December 7, / 93 Example: Fitting parameters in an ODE model We now consider a simple ODE of the form dy dt = f (y) = py y(0) = 1. (4) We have time-dependent data Y k with variance σ k at time points t k. We seek to estimate the decay rate p. As before, we write L = const χ2 2 with χ 2 = k R 2 k = k ( ) y(tk ) Y 2 k. (5) σ k

11 Example: Fitting parameters in an ODE model We now seek to maximise the likelihood, L = const χ2 2 which is equivalent to minimising (6) χ 2 = k R 2 k = k ( ) y(tk ) Y 2 k. (7) σ k As discussed in the previous lecture, we can use the Newton-Raphson algorithm: given an initial estimate for p = p 1 we iterate according to p N+1 = p N [ L] 1 L, (8) where the RHS is evaluated at p N, the estimate after N 1 iterations. Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

12 December 7, / 93 Example: Fitting parameters in an ODE model By differentiating ( ) y(tk ) Y 2 k. (9) χ 2 = k R 2 k = k σ k with respect to p we obtain (ignoring second order derivatives) L(p N ) = 1 2 χ2 = k (y(t k ) Y k ) y(t k) p σ 2 k (10) L = 1 2 χ2 = k ( y(t k) p )2 σk 2. (11) L < 0 so that we indeed obtain the maximal likelihood.

13 December 7, / 93 Example: Fitting parameters in an ODE model L(p N ) = 1 2 χ2 = k L = 1 2 χ2 = k (y(t k ) Y k ) y(t k) p σ 2 k ( y(t k) p )2 σk 2. So how can we determine the sensitivities S k = y(t k) p? We notice that d dy dp dt = d dy dt dp = df (y) dp = f (y) dy f (y) + = J dy f (y) + y dp p dp p (12)

14 December 7, / 93 Example: Fitting parameters in an ODE model L(p N ) = 1 2 χ2 = k L = 1 2 χ2 = k d( dy dp ) dt = J dy f (y) + dp p (y(t k ) Y k ) y(t k) p σ 2 k ( y(t k) p )2 σk 2. f (y) The Jacobian J and p can be calculated ahead of time and the entire equation can then be integrated alongside the integration of the model ODE.

15 Example: Fitting parameters in an ODE model dy dt ds dt = f (y) = py; y(0) = 1 = JS + f (y) p = ps y; S(0) = 0 with Jacobian J = f (y) y OPTIMIZATION: = p, f (y) p = y, and sensitivity S = y p. p N+1 = p N [ L] 1 L L(p N ) = 1 2 χ2 = k (y(t k ) Y k ) y(t k) p σ 2 k L = 1 2 χ2 = ( y(t k) p )2 σ 2. k k Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

16 December 7, / 93 Example: Fitting 2 parameters in an ODE model We now estimate two parameter values for a simple ODE of the form dy dt = f (y) = ρ dy; y(0) = 10. (13) The parameter vector is then given by p = [ρ, d] T. OPTIMIZATION: p N+1 = p N [ L] 1 L L = 1 2 χ2 = k L = 1 2 χ2 = (y(t k ) Y k ) k k σ 2 k ( y(tk ) ρ y(t k ) d ( 1 y(tk ) σk 2 ρ 1 y(t k ) y(t k ) σk 2 ρ d ) ) 2 k k 1 y(t k ) y(t k ) σk 2 ρ d ( ) 2 1 y(tk ) σk 2 d.

17 December 7, / 93 Example: Likelihood Function dependent on parameter values dy dt Optimizing the IC and d. = f (y) = ρ dy; y(0) = 10. Optimizing the correlated production and decay rates.

18 Identifiability Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

19 December 7, / 93 Identifiability Given the structure of a model, is it possible to uniquely estimate its unknown parameters? What experimental data are required to achieve unique parameter identification? Whether the inverse problem is solvable is dependent on the mathematical model the significance of the data the experimental errors.

20 December 7, / 93 Global versus Local Identifiability Global Identifiability A parameter is globally identifiable if it can be uniquely determined given the input profile and assuming continuous and error-free data for the observables of the model. Local Identifiability If there is a countable number of solutions the parameter is locally identifiable. Unidentifiability It is unidentifiable if there exist uncountable many solutions.

21 December 7, / 93 Regression instability Unidentifiable parameters might lead to regression instability. Regression instability A measure for the variation of regression results in the presence of data noise. A regression is unstable if the estimates of model parameters significantly differ when one additional data point is added to the set of input data.

22 December 7, / 93 Structural Identifiability A model is structurally identifiable if its parameters can be uniquely estimated by fitting the model to experimental data. Structural identifiability needs to be tested BEFORE the regression. If some of the necessary data are not available, then structural identifiability analysis reveals which parameters must be eliminated a priori. Importantly, the structural identifiability of a model is independent of the regression scheme that is employed.

23 December 7, / 93 Structural Identifiability Structural identifiability is related to the sensitivity of process output to parameter variations. Sensitivity Analysis A sensitivity analysis evaluates the dependence of a system s output on a certain set of model parameters. Let m = {m 1, m 2,..., m M } be a set of M measurable output and p = {p 1, p 2,..., p P } the set of P unknown model parameters. The sensitivity coefficients Sp m parameters are defined as of the output with respect to the S m i p j = m i p j, i = 1,..., M j = 1,..., P. (14)

24 December 7, / 93 Structural Identifiability Structural Identifiability A model is structurally identifiable if its sensitivity matrix satisfies two conditions: S = m 1 m 1 p P p m M p 1 m M p P (15) each column has at least one large entry (i.e. each parameter has a large impact on at least one experimental measurement) the matrix has full rank (i.e. all columns must be linearly independent, which means that the effects of the parameters on the measurements must be independent of each other.)

25 December 7, / 93 Comments on the Sensitivity analysis As the sensitivities are state and time dependent, the way that the system is observed has a strong impact on the calculated sensitivities. This analysis can be used to suggest new informative experiments. However, since the sensitivities depend on the initial guess of the parameter values and sensitivities usually depend non-linearly on parameters, the results are valid only for a specific parameter set. The sensitivity analysis may have to be carried out over a large set of biologically plausible parameter values.

26 December 7, / 93 Practical Identifiability A parameter that is structurally identifiable may still be practically nonidentifiable. This can arise due to insufficient amount and quality of experimental data or the chosen measurement time points. It manifests in a confidence interval that is infinite, although the likelihood has a unique minimum for this parameter. Practical nonidentifiability A parameter is practically nonidentifiable if the likelihood-based confidence region is infinitely extended in the direction of parameter p i indicated by the likelihood staying below a desired threshold.

27 December 7, / 93 Identifiability LHS: Stuctural Nonidentifiability: If parameters are correlated then only relative values can be determined for the correlated parameters since their effects compensate for each other. Center: Practical nonidentifiability in a 2D plot can be visualized as a relatively flat valley, which is infinitely extended. RHS: Identifiable

28 December 7, / 93 Identifiability: profile likelihood In case of practical nonidentifiability (panels c, d) the corresponding profile likelihood reveals that the set threshold change in χ 2 is not exceeded in one direction.

29 December 7, / 93 Addressing a practical nonidentifiability By increasing the amount and quality of the measured data and/or the choice of measurement time points t i, stricter conditions on the parameter estimation are imposed. This leads to a tightening of confidence intervals that ultimately will remediate a practical nonidentifiability, yielding finite confidence intervals.

30 General ODE Systems Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

31 December 7, / 93 A general ODE Model with many parameters Consider now a dynamical system with N state variables which we describe by a set of ordinary differential equations: d x(t) dt = f ( x(t), t, k), x(t 0 ) = x 0. (16) x(t): vector with all state variables k: vector with all parameters x 0 : vector for all initial expressions

32 December 7, / 93 Observables Often, the state variables cannot be directly observed. Rather, it is combinations of state variables, or relative quantities (or possibly even more complicated functions of the state variables) that are measured in experiments.

33 December 7, / 93 Observables We specify an observation function g : R N R M which maps the state variables x to a set of M observables, y(t) = g( x(t), s) (17) We require both f ( ) and g( ) to be continuously differentiable functions with respect to their parameters. The vector s comprises the parameters of the observation function. Note that we may be able to only partially observe the system such that M < N.

34 December 7, / 93 Parameters The set of problem specific parameters p includes the initial conditions, the model parameters, and the parameters that are specific to the measurements, i.e. p = { x 0, k, s}. (18) The initial values of the dynamical system are also parameters as they are usually unknown.

35 December 7, / 93 Measurement Data We denote the measured data by y ij. Generally, these measurements y ij are subject to error, i.e. they are the sum of the observables y j (t i ) and a measurement error, ɛ ij, i.e. y ij = y j (t i ) + ɛ ij We assume that the measurement errors, ɛ ij, are independent across all observations and all time points that the measurement errors follow independent Gaussian distributions, i.e. ɛ ij N(0, σ ij ) (19) where σ ij is the standard deviation of the measurement errors.

36 December 7, / 93 Type of Error Due to the law of large numbers applies in many practical settings. ɛ ij N(0, σ ij ) (20) However, other distributions, in particular log-normal distributions, are also encountered, for instance when protein concentrations are low. Therefore, normality should be checked in the course of data pre-processing and the data should be transformed appropriately in case of deviations from normality. The observation function g( ) can take account for this transformation.

37 Maximum Likelihood The optimal parameter set is the one with the highest probability of observing the data and can be determined by maximizing the likelihood L of the data y with respect to the parameter set p ( ) T M 1 L( y p) = exp 1 (g j ( x(t i, p), p)) y ij ) 2. σ ij 2π 2 i=1 j=1 σ 2 ij Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

38 December 7, / 93 Log-likelihood In practical terms, to find the maximum of the likelihood function the negative log-likelihood is minimized log[l( y p)] = T M i=1 j=1 1 2 R ij( p) 2 + c ij, (21) R ij ( p) = g j( x(t i, p), p)) y ij σ ij, c ij = log[σ ij 2π]. R ij is called residual. The term c ij is independent of p, and can be left out of the minimization.

39 December 7, / 93 The maximum likelihood estimator The maximum likelihood estimator for the model parameters is thus given by log[l( y p)] T M i=1 j=1 1 2 R ij( p) 2. (22) In the background of independent Gaussian measurement errors the parameters p can therefore be determined by least squares minimization.

40 December 7, / 93 Minimization Common methods to minimize log[l( y p)] are gradient-based such as the classical Newton or Levenberg-Marquardt methods. They are based on an iterative parameter update procedure which computes the next update step by exploiting the gradient of the weighted residuals R ij.

41 December 7, / 93 Gradient of the weighted residuals R ij R ij ( p) = 1 dg j ( x(t i, p l ), p l )) p l σ ij dp l ( = 1 N g j dx n σ ij x n dp l + g ) j ti p l n=1 ti ti (23) g j x n and g j p l are the Jacobians of the differential equation system with respect to the state variables and with respect to the parameters. Sp n l = (dx n )/(dp l ) are the sensitivities of the state variables to changes in the parameter values that we discussed above.

42 December 7, / 93 Calculation of the Sensitivities In general there is no analytic solution to the trajectories x(t, p) and therefore the sensitivities Sp n l = (dx n )/(dp l ) have to be calculated numerically. Naively, one may approximate them by finite differences, which is also the default in many optimization software packages. However, this approximation becomes computationally very expensive in case of a high dimensional parameter space as it demands many integrations of the differential equations.

43 December 7, / 93 Calculation of the Sensitivities Alternatively, the sensitivities can be computed by an integration of the sensitivity equations (as discussed above) in parallel with the ODE model. dsp n l = d dx n = d dx n dt dt dp l dp l dt = df (t, x(t), k) = dp l ( ) { Sp n dxq 1 : pl {x l (0) = (0) = 0 } dp l 0 : p l {s, k} N q=1 f n x q dx q dp l + f n p l

44 December 7, / 93 Practical Considerations Since f ( ) and g( ) are specified beforehand, their respective derivatives with respect to the parameters p can also be computed beforehand, either manually or for more complex systems by applying symbolic computations. As a note aside, the Jacobians should also be provided to the ODE solvers for the differential equation and the sensitivity equations since these will speed up the calculations considerably.

45 December 7, / 93 Practical Considerations Parameter optimization in high-dimensional parameter spaces crucially relies on an efficient and reliable computation of the sensitivities and gradients of the residuals. Solving for the sensitivities by numerical integration has the advantage of getting the sensitivity of certain system properties in parallel with the solution of the ODE model. The limitation of the approach lies in the large number of additional ODEs that need to be solved!!

46 December 7, / 93 Workflow of gradient based minimization procedures Initialize model system and parameters LOOP Integrate ODE and sensitivity equations Calculate Jacobian of residuals Calculate residuals IF change in norm(residuals) < threshold BREAK ELSE Update parameter values using current parameter values and Jacobian ENDIF ENDLOOP Calculate fit statistics, parameter variances and confidence limits

47 December 7, / 93 Workflow of gradient based minimization procedures The procedure starts with an initial set of parameter values. In each cycle of the iteration, the ODE system is solved and the gradient of the weighted residuals is calculated. Based on the current parameter values and the gradient, new parameter values are computed that minimize the sum of squared residuals (residual norm). This update step lies at the heart of gradient based minimization functions such as the Levenberg-Marquardt procedure in MATLAB s lsqnonlin. The iteration continues until the change in residual norm is smaller than a predefined value. Usually, the (relative) change in norm of the residuals is used to check for the convergence of the minimization procedure by some predefined threshold.

48 December 7, / 93 Optimization Algorithms 1 Local Methods Gradient-based Methods Newton-Raphson Iterative Algorithm Levenberg-Marquard Direct, derivative-free Methods Simplex Methods Nelder-Mead Method Conjugate Gradient Method 2 Global Methods Simulated Annealing Evolutionary Algorithms

49 Example: TGF-β Model Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

50 December 7, / 93 Example: Time Courses Sensitivity analysis of the TGFβ signaling model. (A) Time course of the stimulation protocol and the system output, nuclear Smad2 /Smad4 complex. (B) Time resolved sensitivities of parameters controlling the dynamics of nuclear Smad2 /Smad4 complex and of the nuclear Smad2 /Smad4 complex. For details see main text. (C) Clustergram of the steady state control coefficients.

51 December 7, / 93 Performance of different optimization procedures. The table provides a comparison of the Matlab algorithm lsqnonlin to alternative minimizers in the Systems Biology Toolbox ( Av. comp. time (cputime 10 3 ) % correct matches Best perf. (χ 2 ) Max iter. NonLinLS tribes Gen. Alg Simplex Chc PSO Hill Cmaes Different optimizers were run on the same data set. The first column show the average computation time of each method. Every algorithm was run 5 times with different initial parameter values. The second column measures the number of times the algorithm produces an acceptable result (χ 2 < 30). The third column is the average χ 2 value at the optimum. The last column is the maximum number of iterations we allowed for each algorithm before it was stopped. Abbreviations: NonLinLS: Matlab lsqnonlin routine. simpex: Downhill Simplex Method in Multidimensions. cmaes: (15,50)-Evolution Strategy with Covariance Matrix Adaptation. chc: Clustering multi-start hill climbing. ga: Standard Genetic Algorithm with elitism. hill: Multi-start hill climbing. pso: Particle Swarm Optimization with constriction. tribes: adaptive Particle Swarm Optimization.

52 December 7, / 93 Performance Depending on the complexity of the model, the minimization step itself may only be a minor contributor to the total computation time, as is the case for the here-considered TGF-β model. The main computational burden is created by the need to solve the system of ODEs for the state variables and for the sensitivities many times. This is why it is important to provide the Jacobian to the ODE solver. We notice that the different algorithms do, however, differ in how well the optimal parameter set can be estimated. We recommend to compare results from different optimizers.

53 Quality of the Estimate Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

54 December 7, / 93 The concept of best estimate and error-bars Consider the probability density function (pdf) for the parameter X P = prob(x D, I ). (24) Then the best estimate X 0 is given for dp dx = 0, (25) X 0 assuming that X is a continuous parameter. Strictly speaking we also require d 2 P dx 2 < 0. (26) X 0

55 December 7, / 93 Logarithm of pdf To evaluate the reliability of the estimate we explore the neighborhood of the best estimate X 0 with a Taylor series. Rather than dealing directly with the posterior pdf P which is a peaky and positive function, it is better to work with the logarithm L, since this varies much more slowly with X. L = ln(prob(x D, I )), (27)

56 December 7, / 93 Reliability of the estimate Expanding L = ln(prob(x D, I )) about the point X = X 0, we have L = L(X 0 ) + 1 d 2 L 2 dx 2 (X X 0 ) X 0 where the best estimate of X is given by the condition dl dx = 0. (28) X 0 because L is a monotonic function of P.

57 December 7, / 93 Reliability of the estimate L = L(X 0 ) d 2 L dx 2 (X X 0 ) X 0 The first term L(X 0 ) is constant and says nothing about the shape of P(X ). The linear term is missing because dl X 0 = 0. The quadratic term is therefore the dominant factor determining the width of the posterior pdf and plays a central role in the reliability analysis. dx

58 December 7, / 93 Ignoring all the higher order contributions we have L = L(X 0 ) + 1 d 2 L 2 dx 2 (X X 0 ) X 0 [ 1 P = prob(x D, I ) = exp (L) A exp 2 d 2 ] L dx 2 (X X 0 ) 2 X 0 So the posterior pdf can be approximated by a Gaussian distribution: prob(x µ, σ) = 1 ] [ σ 2π exp (x µ)2 2σ 2 with ( d σ 2 2 ) L 1 = dx 2 X 0 µ = X 0.

59 Multi-dimensional Problems Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

60 December 7, / 93 Linear Models We now consider a vector X of parameters. Assume that we have a model f that is linear in the parameter values such that F = f (X, k) = M T k,j X j + C k (29) j=1 where the sum is over all parameters M. In matrix-vector notation we then have F = TX + C. (30)

61 Recall L = const χ2 2 with χ 2 = k R 2 k = k ( ) Fk D 2 k. (31) σ k We can then determine the first derivative of the likelihood function L with respect to parameter X j as Since L = 1 χ 2 = X j 2 X j F = f (X, k) = N ( Fk D k k=1 σ 2 k ) Fk X j. (32) M T k,j X j + C k (33) j=1 we have F k = T kj = const. (34) X j Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

62 L = 1 χ 2 = X j 2 X j N ( Fk D k k=1 For the second derivative we then have 2 L N T ki T kj = X i X j σk 2 σ 2 k ) Fk X j. (35) F k X j = T kj = const. (36) k=1 = const. (37) All higher derivatives are therefore zero. We can therefore expand the likelihood function around its maximum L 0 as L = L 0 + i L(X i X i0 ) + 1 M M ( L) ij (X i X i0 )(X j X 0j )(38) 2 i i=1 j=1 }{{}}{{} =0 1 2 Q Computational Biology Group (CoBI), D-BSSE, ETHZ Prof Dagmar Iber, PhD DPhil Lecture 10 BSc Biotechnology December 7, / 93

63 December 7, / 93 Hessian Matrix L L = L 0 + i L(X i X i0 ) + 1 M M ( L) ij (X i X i0 )(X j X 0j ) 2 i i=1 j=1 }{{}}{{} =0 1 2 Q Since we expand around the maximum L 0 the first derivative is by definition zero. But what about the second term, Q? For L to have a maximum at X 0, we, of course, require for all X X 0. Q = ( X X 0 ) T [ L]( X X 0 ) < 0 (39)

64 December 7, / 93 Hessian Matrix must be negative definite For L to have a maximum at X 0 for all X X 0. Q = ( X X 0 ) T [ L]( X X 0 ) < 0 (40) This means that the ( L) ij, the Hessian matrix of L, must be negative definite. Negative definite matrix A negative definite matrix is a Hermitian matrix all of whose eigenvalues are negative.

65 December 7, / 93 Conditions for maximum Likelihood To evaluate Q we seek to diagonalize the Hessian matrix of L. To diagonalize we need to determine the eigenvalues λ and eigenvectors [x, y] T of ( L). In the following we will write ( L) = 2 L Xi 2 2 L X j X i 2 L X i X j 2 L Xj 2 [ A C = C B ]. (41) Thus, [ A C C B ] ( x y ) ( x = λ y ). (42) Q < 0 for all ( X X 0 ) if λ 1, λ 2 < 0, i.e. if the trace of ( L) is negative (A + B < 0) and the determinant positive, AB > C 2.

66 December 7, / 93 The contour of Q Recall, that we can rewrite where Q = ( X X 0 ) T [ L]( X X 0 ) = ( X X 0 ) T EΛE 1 ( X X 0 ) = n λ j (E 1 ( X X 0 )) 2 j. (43) j=1 E is a square matrix whose columns are the eigenvectors of [ L] in any order, Λ is the diagonal matrix such that Λ ii is the eigenvalue associated to column i of E.

67 December 7, / 93 Derivation We first transform to the new coordinate system spanned by the eigenvectors, such that x = E y (44) where the columns of E are the normalised eigenvectors of [ L]. Since the eigenvectors are orthogonal to each other, this means that (E T E) ij = δ ij and (E T [ L]E) ij = λ j δ ij, (45) where δ ij = 1 if i = j and zero otherwise.

68 December 7, / 93 Derivation We now write x = ( X X 0 ) = E y. With this change of variables, we have Q = ( X X 0 ) T [ L]( X X 0 ) = x T [ L] x = (E y) T [ L](E y) = y T (E T [ L]E) y 2 = λ j yj 2, and y = E 1 x. (46) j=1 We now see that the contour of Q is an ellipse, centered at (X 0, Y 0 ), with axes in direction of the eigenvectors.

69 December 7, / 93 The contour of Q Suppose we want to determine all points x = ( X X 0 ) = E y for which Q = k. The contour of Q is an ellipse, centered at (X 0, Y 0 ). The principal axes of the ellipse correspond to the eigenvectors of [ L]. Within our quadratic approximation the posterior pdf is constant on each level line. For a given contour level (Q = k) the size of the ellipse is determined by the eigenvalues.

70 December 7, / 93 The contour of Q The contour of Q is an ellipse, centered at (X 0, Y 0 ). The principal axes of the ellipse correspond to the eigenvectors of [ L]. Within our quadratic approximation the posterior pdf is constant on each level line. For a given contour level (Q = k) the size of the ellipse is determined by the eigenvalues.

71 December 7, / 93 Variance & Co-Variance We are now interested in the variances and co-variances to determine the quality of the parameter estimate. Variance & Co-Variance The covariance between two jointly distributed real-valued random variables X and Y with finite second moments is defined as σ(x, Y ) = E [ (X E[X ])(Y E[Y ]) ], (47) where E[X ] is the expected value of X, also known as the mean of X.

72 December 7, / 93 Variance & Co-Variance By using the linearity property of expectations, this can be simplified to σ(x, Y ) = E [(X E [X ]) (Y E [Y ])] = E [XY X E [Y ] E [X ] Y + E [X ] E [Y ]] = E [XY ] E [X ] E [Y ] E [X ] E [Y ] + E [X ] E [Y ] = E [XY ] E [X ] E [Y ]. (48)

73 December 7, / 93 Variance & Co-Variance For random vectors X and Y (both of dimension m) the m m cross covariance matrix (also known as dispersion matrix or variance-covariance matrix, or simply called covariance matrix) is equal to σ(x, Y) = E [ (X E[X])(Y E[Y]) T] (49) = E [ XY T] E[X] E[Y] T, (50) where m T is the transpose of the vector (or matrix) m.

74 December 7, / 93 Variance & Co-Variance The (i,j)-th element of the matrix σ(x, Y) is equal to the covariance Cov(X i, Y j ) between the i-th scalar component of X and the j-th scalar component of Y. Note, that Cov(Y, X ) is the transpose of Cov(X, Y ). For a vector X = [ X 1 X 2... X m ] T of m jointly distributed random variables with finite second moments, its covariance matrix is defined as Σ(X) = σ(x, X). (51) Random variables whose covariance is zero are called uncorrelated.

75 December 7, / 93 Variance & Co-Variance We first estimate the variance of X, σx 2. If we are interested only in X then we can integrate out Y as a nuisance parameter prob(x D, I ) = prob(x, Y D, I )dy = exp (L)dy ( = exp L 0 + Q ) dy 2 ( = exp L ) 2 (Ax 2 + By 2 + 2Cxy) dy ( = exp L ) 2 (Ax 2 ) exp ( 1 ( By 2 + 2Cxy) ) dy 2

76 December 7, / 93 To complete the square we substitute such that By 2 + 2Cxy = B prob(x D, I ) exp ( y + Cx ) 2 (Cx)2 B B. (52) ( 1 2 (A C 2 ) B )x 2 exp ( 1 2 B (y + φ)2 dy φ = Cx B is an irrelevant offset as the integral extends from minus infinity to plus infinity.

77 December 7, / 93 Recall that Thus, exp ( y 2 ) 2σ 2 dy = σ 2π. (53) ( ) 1 2π exp B (y + φ)2 dy = 2 B with σ 2 = 1 B. (54)

78 December 7, / 93 We then have ( 1 prob(x D, I ) exp 2 (A C 2 ) B )x 2 exp ( 1 2 B (y + φ)2 dy ( 2π 1 = B exp 2 (A C 2 ) B )x 2 ( ) 2π x 2 = B exp (55) with σ 2 X = 2σ 2 X B AB C 2. (56)

79 December 7, / 93 Variance & Co-Variance We thus have for the variance in X σ 2 X = B AB C 2. (57) σ 2 Y can be obtained in a similar way by integrating out X, Moreover, σ 2 Y = σ 2 X,Y = A AB C 2. (58) C AB C 2. (59)

80 December 7, / 93 Co-variance Matrix [ σ 2 x σ 2 xy σ 2 xy σ 2 y ] = 1 [ B C AB C 2 C A ] [ A C = C B ] 1 = [ L] 1 AB C 2 is the determinant of [ L]. For such a real symmetric matrix, this is equal to the product of its eigenvalues. Consequently, if either λ 1 or λ 2 becomes very small, so that the ellipse is extremely elongated in one of its principal directions, then AB C 2 0; consequently, apart from the special case when C = 0 (no co-variance), σ x and σ y will be huge.

81 December 7, / 93 Co-variance Matrix Consequently, if either λ 1 or λ 2 becomes very small, so that the ellipse is extremely elongated in one of its principal directions, then AB C 2 0; consequently, apart from the special case when C = 0 (no co-variance), σ x and σ y will be huge.

82 December 7, / 93 Correlation The correlation coefficient ρ(x, Y ) between two random variables X and Y with expected values E[X ] = µ X and E[Y ] = µ Y and standard deviations σ X and σ Y is defined as: ρ X,Y = corr(x, Y ) = cov(x, Y ) σ X σ Y = E[(X µ X )(Y µ Y )] σ X σ Y. (60)

83 December 7, / 93 Weak Correlation When the correlation is negligible then σ 2 XY σ 2 xσ 2 y.

84 December 7, / 93 Strong Correlation [ σ 2 x σ 2 xy σ 2 xy σ 2 y ] = 1 [ B C AB C 2 C A ] [ B C = C A ] 1 An extreme situation is obtained for C 2 = AB. In that case the elliptical contours will be infinitely wide in one direction (with only the information of the prior preventing the catastrophe!) and oriented at an angle tan 1 A B with respect to the X -axis. Although the error-bars σ X and σ Y will be huge, i.e. the individual estimates of X and Y will be unreliable, the large off-diagonal elements of the co-variance matrix tells us that we can still infer a linear combination of these parameters very well.

85 December 7, / 93 Co-variance & Correlation In the first image the co-variance (correlation) is zero. In the second image, the co-variance is large and negative (Y + mx = const). In the third image, the co-variance is large and positive (Y mx = const).

86 December 7, / 93 Error Bars An illustration between the (correct) marginal error-bar σ ii and its misleading best-fit counterpart indicated by the bold line in the middle. The former correct one is given by the ith diagonal element of ( L) 1, while the latter is related to the inverse of L. Xi 2

87 December 7, / 93 Asymmetric and multimodel posterior pdfs A schematic illustration of two posterior pdfs for which the multivariate Gaussian is NOT a good approximation.

88 December 7, / 93 Asymmetric and multimodel posterior pdfs Here the concepts of error-bar and linear correlation associated with the covariance matrix are not suitable. We might think of generalizing the idea of confidence intervals to a multidimensional space,but it will usually be hard to describe the surface of the (smallest) hypervolume containing 90% of the probability (say) in just a few numbers. The situation may be even worse when the pdf has several maxima. This relates to the challenge of finding global maxima in a large multidimensional parameter space when there are many (small) little hillocks.

89 December 7, / 93 Co-Variance Matrix & Hessian Matrix We notice that [σ 2 ] ij = [( L) 1 ] ij = 2[ (χ 2 )] 1 ij = [H 1 ] ij (61) where H refers to the Hessian matrix. [σ 2 ] ij is referred to as covariance matrix C C = < (X i X i0 )(X j X 0j ) >= [σ 2 ] ij = [( L) 1 ] ij = 2[ (χ 2 )] 1 ij = H 1 (62) In conclusion, the likelihood function (and thus the posterior pdf) are defined completely by the optimal solution X 0 and the second derivative of L at the maximum L 0 which corresponds to the covariance matrix C.

90 December 7, / 93 Example 1: Fitting two parameter of a straight line In the next step we want to estimate both m and c. We use vector-matrix formulation and write ( ) ( ) mn+1 mn = [ L] 1 L N. (63) c n+1 c n where L N = 1 σk 2 xk 2 L = k σk 2 k (m N x k +c N Y k )x k k σ k 2 m N x k +c N Y k k σk 2 x k σ 2 k x k k σ k 2 k 1 σk 2,. (64)

91 December 7, / 93 As seen above the variance of the estimate can be determined as σ 2 = [ L] 1 1 k 1 = k 1 x 2 σk 2 k k ( σ 2 k k x k σk 2 k ) σ 2 ) x k k 2 k σk 2 k x k σk 2 xk 2 σk 2 (65 If we do not know the error-bars for the data, then we might do the analysis by assuming that they are all of the same size, i.e. σ k = σ and replace σ by the estimate S: S 2 = 1 N 1 (m 0 x k + c 0 Y k ) 2 (66) where m 0, c 0 denote the best estimate of the parameters. k

92 December 7, / 93 Summary & Outlook 1 Pre-regression Diagnostics: Identifiability 2 Bayesian Inference 3 Optimization Algorithms Next Step: Post-regression Diagnostics

93 December 7, / 93 Thanks!! Thanks for your attention! Slides for this talk will be available at:

Lecture 10: The quality of the Estimate

Lecture 10: The quality of the Estimate Computational Biology Group (CoBI), D-BSSE, ETHZ Lecture 10: The quality of the Estimate Prof Dagmar Iber, PhD DPhil BSc Biotechnology 2015/16 May 12, 2016 2 / 102 Contents 1 Review of Previous Lectures

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester Physics 403 Numerical Methods, Maximum Likelihood, and Least Squares Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Quadratic Approximation

More information

component risk analysis

component risk analysis 273: Urban Systems Modeling Lec. 3 component risk analysis instructor: Matteo Pozzi 273: Urban Systems Modeling Lec. 3 component reliability outline risk analysis for components uncertain demand and uncertain

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

01 Probability Theory and Statistics Review

01 Probability Theory and Statistics Review NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement

More information

8 - Continuous random vectors

8 - Continuous random vectors 8-1 Continuous random vectors S. Lall, Stanford 2011.01.25.01 8 - Continuous random vectors Mean-square deviation Mean-variance decomposition Gaussian random vectors The Gamma function The χ 2 distribution

More information

Minimization of Static! Cost Functions!

Minimization of Static! Cost Functions! Minimization of Static Cost Functions Robert Stengel Optimal Control and Estimation, MAE 546, Princeton University, 2017 J = Static cost function with constant control parameter vector, u Conditions for

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 2: Multivariate distributions and inference Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2016/2017 Master in Mathematical

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Bayesian Data Analysis

Bayesian Data Analysis An Introduction to Bayesian Data Analysis Lecture 1 D. S. Sivia St. John s College Oxford, England April 29, 2012 Some recommended books........................................................ 2 Outline.......................................................................

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Notes on Random Vectors and Multivariate Normal

Notes on Random Vectors and Multivariate Normal MATH 590 Spring 06 Notes on Random Vectors and Multivariate Normal Properties of Random Vectors If X,, X n are random variables, then X = X,, X n ) is a random vector, with the cumulative distribution

More information

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Taylor s Theorem Can often approximate a function by a polynomial The error in the approximation

More information

Lecture Note 1: Probability Theory and Statistics

Lecture Note 1: Probability Theory and Statistics Univ. of Michigan - NAME 568/EECS 568/ROB 530 Winter 2018 Lecture Note 1: Probability Theory and Statistics Lecturer: Maani Ghaffari Jadidi Date: April 6, 2018 For this and all future notes, if you would

More information

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as L30-1 EEL 5544 Noise in Linear Systems Lecture 30 OTHER TRANSFORMS For a continuous, nonnegative RV X, the Laplace transform of X is X (s) = E [ e sx] = 0 f X (x)e sx dx. For a nonnegative RV, the Laplace

More information

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly. C PROPERTIES OF MATRICES 697 to whether the permutation i 1 i 2 i N is even or odd, respectively Note that I =1 Thus, for a 2 2 matrix, the determinant takes the form A = a 11 a 12 = a a 21 a 11 a 22 a

More information

Multivariate Random Variable

Multivariate Random Variable Multivariate Random Variable Author: Author: Andrés Hincapié and Linyi Cao This Version: August 7, 2016 Multivariate Random Variable 3 Now we consider models with more than one r.v. These are called multivariate

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

Numerical Optimization

Numerical Optimization Numerical Optimization Unit 2: Multivariable optimization problems Che-Rung Lee Scribe: February 28, 2011 (UNIT 2) Numerical Optimization February 28, 2011 1 / 17 Partial derivative of a two variable function

More information

Non-linear least squares

Non-linear least squares Non-linear least squares Concept of non-linear least squares We have extensively studied linear least squares or linear regression. We see that there is a unique regression line that can be determined

More information

Scientific Computing: Optimization

Scientific Computing: Optimization Scientific Computing: Optimization Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course MATH-GA.2043 or CSCI-GA.2112, Spring 2012 March 8th, 2011 A. Donev (Courant Institute) Lecture

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory [POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017 For today... Basic linear algebra. Basic probability. Programming

More information

2 (Statistics) Random variables

2 (Statistics) Random variables 2 (Statistics) Random variables References: DeGroot and Schervish, chapters 3, 4 and 5; Stirzaker, chapters 4, 5 and 6 We will now study the main tools use for modeling experiments with unknown outcomes

More information

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO LECTURE NOTES FYS 4550/FYS9550 - EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I PROBABILITY AND STATISTICS A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO Before embarking on the concept

More information

1 Bayesian Linear Regression (BLR)

1 Bayesian Linear Regression (BLR) Statistical Techniques in Robotics (STR, S15) Lecture#10 (Wednesday, February 11) Lecturer: Byron Boots Gaussian Properties, Bayesian Linear Regression 1 Bayesian Linear Regression (BLR) In linear regression,

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

18 Bivariate normal distribution I

18 Bivariate normal distribution I 8 Bivariate normal distribution I 8 Example Imagine firing arrows at a target Hopefully they will fall close to the target centre As we fire more arrows we find a high density near the centre and fewer

More information

Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain

Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain Generalization Generalization A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain 2 Generalization The network input-output mapping is accurate for

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2 MA 575 Linear Models: Cedric E Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2 1 Revision: Probability Theory 11 Random Variables A real-valued random variable is

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Numerical Methods I Solving Nonlinear Equations

Numerical Methods I Solving Nonlinear Equations Numerical Methods I Solving Nonlinear Equations Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 October 16th, 2014 A. Donev (Courant Institute)

More information

Unsupervised Learning: Dimensionality Reduction

Unsupervised Learning: Dimensionality Reduction Unsupervised Learning: Dimensionality Reduction CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 3 Outline In this lecture, we set about to solve the problem posed in the previous lecture Given a dataset,

More information

Lecture 8: Tissue Mechanics

Lecture 8: Tissue Mechanics Computational Biology Group (CoBi), D-BSSE, ETHZ Lecture 8: Tissue Mechanics Prof Dagmar Iber, PhD DPhil MSc Computational Biology 2015/16 7. Mai 2016 2 / 57 Contents 1 Introduction to Elastic Materials

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit Statistics Lent Term 2015 Prof. Mark Thomson Lecture 2 : The Gaussian Limit Prof. M.A. Thomson Lent Term 2015 29 Lecture Lecture Lecture Lecture 1: Back to basics Introduction, Probability distribution

More information

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 Prof. Steven Waslander p(a): Probability that A is true 0 pa ( ) 1 p( True) 1, p( False) 0 p( A B) p( A) p( B) p( A B) A A B B 2 Discrete Random Variable X

More information

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Lecture 23:! Nonlinear least squares!! Notes Modeling2015.pdf on course

More information

Lecture 16 Solving GLMs via IRWLS

Lecture 16 Solving GLMs via IRWLS Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence

More information

A Probability Review

A Probability Review A Probability Review Outline: A probability review Shorthand notation: RV stands for random variable EE 527, Detection and Estimation Theory, # 0b 1 A Probability Review Reading: Go over handouts 2 5 in

More information

Applied Mathematics 205. Unit I: Data Fitting. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit I: Data Fitting. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit I: Data Fitting Lecturer: Dr. David Knezevic Unit I: Data Fitting Chapter I.4: Nonlinear Least Squares 2 / 25 Nonlinear Least Squares So far we have looked at finding a best

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

Physics 403. Segev BenZvi. Propagation of Uncertainties. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Propagation of Uncertainties. Department of Physics and Astronomy University of Rochester Physics 403 Propagation of Uncertainties Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Maximum Likelihood and Minimum Least Squares Uncertainty Intervals

More information

Introduction to Probability and Stocastic Processes - Part I

Introduction to Probability and Stocastic Processes - Part I Introduction to Probability and Stocastic Processes - Part I Lecture 2 Henrik Vie Christensen vie@control.auc.dk Department of Control Engineering Institute of Electronic Systems Aalborg University Denmark

More information

Uncorrelatedness and Independence

Uncorrelatedness and Independence Uncorrelatedness and Independence Uncorrelatedness:Two r.v. x and y are uncorrelated if C xy = E[(x m x )(y m y ) T ] = 0 or equivalently R xy = E[xy T ] = E[x]E[y T ] = m x m T y White random vector:this

More information

Summary of Linear Least Squares Problem. Non-Linear Least-Squares Problems

Summary of Linear Least Squares Problem. Non-Linear Least-Squares Problems Lecture 7: Fitting Observed Data 2 Outline 1 Summary of Linear Least Squares Problem 2 Singular Value Decomposition 3 Non-Linear Least-Squares Problems 4 Levenberg-Marquart Algorithm 5 Errors with Poisson

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Multivariate Distributions

Multivariate Distributions Copyright Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/~cshalizi/adafaepov/ Appendix E Multivariate Distributions E.1 Review of Definitions Let s review

More information

Vectors and Matrices Statistics with Vectors and Matrices

Vectors and Matrices Statistics with Vectors and Matrices Vectors and Matrices Statistics with Vectors and Matrices Lecture 3 September 7, 005 Analysis Lecture #3-9/7/005 Slide 1 of 55 Today s Lecture Vectors and Matrices (Supplement A - augmented with SAS proc

More information

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition) NONLINEAR PROGRAMMING (Hillier & Lieberman Introduction to Operations Research, 8 th edition) Nonlinear Programming g Linear programming has a fundamental role in OR. In linear programming all its functions

More information

Multivariate Distribution Models

Multivariate Distribution Models Multivariate Distribution Models Model Description While the probability distribution for an individual random variable is called marginal, the probability distribution for multiple random variables is

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

September Math Course: First Order Derivative

September Math Course: First Order Derivative September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which

More information

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac February 2, 205 Given a data set D = {(x i,y i )} n the objective is to learn the relationship between features and the target. We usually start by hypothesizing

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

COMP 558 lecture 18 Nov. 15, 2010

COMP 558 lecture 18 Nov. 15, 2010 Least squares We have seen several least squares problems thus far, and we will see more in the upcoming lectures. For this reason it is good to have a more general picture of these problems and how to

More information

Multiple Random Variables

Multiple Random Variables Multiple Random Variables This Version: July 30, 2015 Multiple Random Variables 2 Now we consider models with more than one r.v. These are called multivariate models For instance: height and weight An

More information

Lecture 11. Probability Theory: an Overveiw

Lecture 11. Probability Theory: an Overveiw Math 408 - Mathematical Statistics Lecture 11. Probability Theory: an Overveiw February 11, 2013 Konstantin Zuev (USC) Math 408, Lecture 11 February 11, 2013 1 / 24 The starting point in developing the

More information

A Bayesian Treatment of Linear Gaussian Regression

A Bayesian Treatment of Linear Gaussian Regression A Bayesian Treatment of Linear Gaussian Regression Frank Wood December 3, 2009 Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model y β, σ 2, X N(Xβ,

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

Signal Processing - Lecture 7

Signal Processing - Lecture 7 1 Introduction Signal Processing - Lecture 7 Fitting a function to a set of data gathered in time sequence can be viewed as signal processing or learning, and is an important topic in information theory.

More information

Review (Probability & Linear Algebra)

Review (Probability & Linear Algebra) Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University Lecture 19 Modeling Topics plan: Modeling (linear/non- linear least squares) Bayesian inference Bayesian approaches to spectral esbmabon;

More information

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous

More information

Computational Methods. Least Squares Approximation/Optimization

Computational Methods. Least Squares Approximation/Optimization Computational Methods Least Squares Approximation/Optimization Manfred Huber 2011 1 Least Squares Least squares methods are aimed at finding approximate solutions when no precise solution exists Find the

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

The Multivariate Gaussian Distribution

The Multivariate Gaussian Distribution The Multivariate Gaussian Distribution Chuong B. Do October, 8 A vector-valued random variable X = T X X n is said to have a multivariate normal or Gaussian) distribution with mean µ R n and covariance

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part X Factor analysis When we have data x (i) R n that comes from a mixture of several Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting,

More information

Lecture 6: Selection on Multiple Traits

Lecture 6: Selection on Multiple Traits Lecture 6: Selection on Multiple Traits Bruce Walsh lecture notes Introduction to Quantitative Genetics SISG, Seattle 16 18 July 2018 1 Genetic vs. Phenotypic correlations Within an individual, trait values

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

CLASS NOTES Computational Methods for Engineering Applications I Spring 2015

CLASS NOTES Computational Methods for Engineering Applications I Spring 2015 CLASS NOTES Computational Methods for Engineering Applications I Spring 2015 Petros Koumoutsakos Gerardo Tauriello (Last update: July 27, 2015) IMPORTANT DISCLAIMERS 1. REFERENCES: Much of the material

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

ECE 636: Systems identification

ECE 636: Systems identification ECE 636: Systems identification Lectures 3 4 Random variables/signals (continued) Random/stochastic vectors Random signals and linear systems Random signals in the frequency domain υ ε x S z + y Experimental

More information

Statistical Methods in Particle Physics

Statistical Methods in Particle Physics Statistical Methods in Particle Physics Lecture 10 December 17, 01 Silvia Masciocchi, GSI Darmstadt Winter Semester 01 / 13 Method of least squares The method of least squares is a standard approach to

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Scientific Computing: An Introductory Survey Chapter 6 Optimization Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction permitted

More information

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Scientific Computing: An Introductory Survey Chapter 6 Optimization Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction permitted

More information