Data Assimilation for Weather Forecasting: Reducing the Curse of Dimensionality

Data Assimilation for Weather Forecasting: Reducing the Curse of Dimensionality Philippe Toint (with S. Gratton, S. Gürol, M. Rincon-Camacho, E. Simon and J. Tshimanga) University of Namur, Belgium Leverhulme Lecture IV, Oxford, December 2015

Thanks Leverhulme Trust, UK Balliol College, Oxford Belgian Fund for Scientific Research (FNRS) University of Namur, Belgium Philippe Toint (naxys) Oxford, December 2015 2 / 45

Talk outline 1 The 4DVAR approach to data assimilation The context The formulation and algorithmic approach Using the background covariance Range-space iterative solvers Impact of nonlinearity Conclusions for Section 1 2 Data thinning Observation hierarchy Computational examples Conclusions for Section 2 3 Final comments and some references Philippe Toint (naxys) Oxford, December 2015 3 / 45

The 4DVAR approach to data assimilation The context Data assimilation in earth sciences The state of the atmosphere or the ocean (the system) is characterized by state variables that are classically designated as fields: velocity components pressure density temperature salinity A dynamical model predicts the state of the system at a time given the state of the ocean at a earlier time. We address here this estimation problem. Applications are found in climate, meteorology, ocean, neutronic, hydrology, seismic,... (forecasting) problems. Involving large computers and nearly real-time computations. Philippe Toint (naxys) Oxford, December 2015 4 / 45

The 4DVAR approach to data assimilation The context Data assimilation in weather forecasting (2) Data: température, wind, pression,... everywhere and at all times! May involve more than 1.000.000.000 variables! Philippe Toint (naxys) Oxford, December 2015 5 / 45

The 4DVAR approach to data assimilation The context Data assimilation in weather forecasting (3) The principle: 1 0.8 0.6 0.4 0.2 Situation 2.5 days ago and background prediction 0 0.2 0.4 2.5 2 1.5 1 0.5 0 0.5 1 temperature/days Philippe Toint (naxys) Oxford, December 2015 6 / 45

The 4DVAR approach to data assimilation The context Data assimilation in weather forecasting (3) The principle: 1 0.8 0.6 0.4 0.2 Situation 2.5 days ago and background prediction Température for the last 2.5 days 0 0.2 0.4 2.5 2 1.5 1 0.5 0 0.5 1 temperature/days Philippe Toint (naxys) Oxford, December 2015 6 / 45

The 4DVAR approach to data assimilation The context Data assimilation in weather forecasting (3) The principle: Minimize the error between model and past observations 1 0.8 0.6 0.4 0.2 0 Situation 2.5 days ago and background prediction Température for the last 2.5 days Run the model to minimize the gap between I model and observations 0.2 0.4 2.5 2 1.5 1 0.5 0 0.5 1 temperature/days 1 min x 0 2 x 0 x b 2 B + 1 1 2 N i=0 HM(t i, x 0 ) b i 2. R 1 i Philippe Toint (naxys) Oxford, December 2015 6 / 45

The 4DVAR approach to data assimilation The context Data assimilation in weather forecasting (3) The principle: Minimize the error between model and past observations 1 0.8 0.6 0.4 0.2 0 0.2 Situation 2.5 days ago and background prediction Température for the last 2.5 days Run the model to minimize the gap between I model and observations Predict tomorrow s temperature 0.4 2.5 2 1.5 1 0.5 0 0.5 1 temperature/days Philippe Toint (naxys) Oxford, December 2015 6 / 45

The 4DVAR approach to data assimilation The formulation and algorithmic approach Four-Dimensional Variational (4D-Var) formulation Very large-scale nonlinear weighted least-squares problem: min x R n f (x) = 1 2 x x b 2 B 1 + 1 2 N j=0 H j (M j (x)) y j 2 R 1 j where: Size of real (operational) problems: x, x b R 109, y j R 105 The observations y j and the background x b are noisy M j are model operators (nonlinear) H j are observation operators (nonlinear) B is the covariance background error matrix R j are covariance observation error matrices Philippe Toint (naxys) Oxford, December 2015 7 / 45

The 4DVAR approach to data assimilation Incremental 4D-Var The formulation and algorithmic approach Rewrite the problem as: min x R n f (x) = 1 2 ρ(x) 2 2 Incremental 4D-Var = inexact/truncated Gauss-Newton algorithm Linearize ρ around the current iterate x and solve 1 min x R n 2 ρ( x) + J( x)(x x) 2 2 where J( x) is the Jacobian of ρ(x) at x Solve a sequence of linear systems (normal equations) J T ( x)j( x)(x x) = J T ( x)ρ( x) where the matrix is symmetric positive definite and varies along the iterations Philippe Toint (naxys) Oxford, December 2015 8 / 45

The 4DVAR approach to data assimilation Inner minimization The formulation and algorithmic approach Ignoring superscripts, minimizing J(δx 0 ) = 1 2 δx 0 [x b x 0 ] 2 B 1 + 1 2 Hδx 0 d 2 R 1 amounts to iteratively solve (B 1 + H T R 1 H)δx 0 = B 1 (x b x 0 ) + H T R 1 d. whose exact solution is ( 1 x b x 0 + B 1 + H T R H) 1 H T R 1 (d H(x b x 0 )), or equivalently (using the Sherman-Morrison-Woodbury formula) x b x 0 + BH T ( R + HBH T ) 1 (d H(xb x 0 )). Philippe Toint (naxys) Oxford, December 2015 9 / 45

The 4DVAR approach to data assimilation The formulation and algorithmic approach Solving the 4D-VAR subproblem That is: (I + BH T R 1 H)x = BH T R 1 d In pratice: use Conjugate Gradients (or other Krylov space solver more later on this) for a (very) limited number of level-2 iterations (with preconditioning more later on this) need products of the type (I + BH T R 1 H)v for a number of vectors v Focus now on how to compute Bv (B large) Philippe Toint (naxys) Oxford, December 2015 10 / 45

The 4DVAR approach to data assimilation Modelling covariance Using the background covariance A widely used approach (Derber + Rosati, 1989, Weaver + Courtier, 2001): Spatial background correlation diffusion process i.e. Computing Bv integrating a diffusion equation starting from the state v. use p steps of an implicit integration scheme (level-3 iteration, each involving a solve with B!!!) Philippe Toint (naxys) Oxford, December 2015 11 / 45

The 4DVAR approach to data assimilation The integration iteration Using the background covariance Define Θ h = I + L 2p h ( h is the discrete Laplacian, L is the correlation length). For each integration (z and p given) ( ) 1/2 1 u 0 = diag(θ p h z (diagonal scaling) 2 u l = Θ 1 h u l 1 (l = 1,..., p) ( ) 1/2 3 Bz = diag(θ p h up (diagonal scaling) Our question: how to solve Θ h u l = u l 1? Philippe Toint (naxys) Oxford, December 2015 12 / 45

The 4DVAR approach to data assimilation The integration iteration Using the background covariance An (pratically important) question: how to solve Θ h u l = u l 1? Carrier + Ngodock (2010): Implicit integration + CG is 5 times faster than explicit integration! But: What about multigrid?? Is an approximate solution of the system (CG or MG) altering the spatial properties of the correlation? Inexact solves? Philippe Toint (naxys) Oxford, December 2015 13 / 45

The 4DVAR approach to data assimilation Using the background covariance Approximately diffusing a Dirac pulse Compare the diffusion of a Dirac pulse using approximate linear solvers and exact factorization, as a function of correlation length: L = 4%, λ max = 1.7669 L = 8%, λ max = 4.0678 L = 11%, λ max = 7.1356 1 1 1 0.8 mg cg2 0.8 0.8 correlation 0.6 0.4 cg3 cg4 direct correlation 0.6 0.4 mg cg2 cg3 cg4 correlation 0.6 0.4 mg cg2 cg3 cg4 0.2 0.2 direct 0.2 direct 0 0 5 10 15 20 25 30 x 0 0 5 10 15 20 25 30 x 0 0 5 10 15 20 25 30 x Note: cost(1 MG V-cycle) cost(4 CG iterations) Philippe Toint (naxys) Oxford, December 2015 14 / 45

The 4DVAR approach to data assimilation Using the background covariance Comparing the computational costs of CG vs MG Consider a complete data assimilation exercize on a 2D shallow-water system (3 level-1 iterations, 15 level-2 iterations, p = 6, tol = 10 4 ) Number of normalized matrix-vectors products as a function of problem size MG is a interesting alternative to CG in the integration loop Philippe Toint (naxys) Oxford, December 2015 15 / 45

The 4DVAR approach to data assimilation Range-space iterative solvers Solving the Gauss-Newton model: PSAS System matrix (after level 1 preconditioning) = low (???) rank perturbation of the identity 1 Very popular when few observations compared to model variables. Stimulated a lots of discussion e.g. in the Ocean and Atmosphere communities (cfr P. Gauthier) 2 Relies on x b x 0 + BH T ( R + HBH T ) 1 (d H(xb x 0 )) 3 Iteratively solve ( I + R 1 HBH T ) w = R 1 (d H(x b x 0 )) for w 4 Set δx 0 = x b x 0 + BH T w Philippe Toint (naxys) Oxford, December 2015 16 / 45

Experiments The 4DVAR approach to data assimilation Range-space iterative solvers 10 5 size(b) = 1024 size(r) = 256 RPCG PSAS Cost function 10 4 10 3 10 2 0 20 40 60 80 Iterations Philippe Toint (naxys) Oxford, December 2015 17 / 45

The 4DVAR approach to data assimilation Range-space iterative solvers Motivation : PSAS and CG-like algorithm 1 CG minimizes the Incremental 4D-Var function during its iterations. It minimizes a quadratic approximation of the non quadratic function using Gauss-Newton in the model space. 2 PSAS does not minimize the Incremental 4D-Var function during its iterations but works in the observation space. Our goal : combine the advantages of both approaches: the variational property: enforce sufficient descent even when iterations are truncated. computational efficiency: work in the (dual) observation space whenever the number of observations is significantly smaller than the size of the state vector Preserve global convergence in the observation space! Philippe Toint (naxys) Oxford, December 2015 18 / 45

The 4DVAR approach to data assimilation Range space reformulation Range-space iterative solvers Use conjugate-gradients (CG) to solve the step computation min J(s) def = 1(x 2 s +s x b ) T B 1 (x s +s x b )+ 1(Hs 2 x d)t R 1 (Hs d) 0 Reformulate as a constrained problem: min J(s) def = 1(x 2 s + s x b ) T B 1 (x s + s x b ) + 1 2 x at R 1 a 0 such that a = Hs d Write KKT conditions of this problem (for x s = x b, wlog) (R + HBH T )λ = d, s = BH T λ, a = Rλ Precondition (1rst level) by R, forget a: (I + R 1 HBH T )λ = R 1 d, s = BH T λ, Solve system using (preconditioned) CG in the H T BH inner product RPCG (Gratton, Tshimanga, 2009) Philippe Toint (naxys) Oxford, December 2015 19 / 45

The 4DVAR approach to data assimilation RPCG and preconditoning Range-space iterative solvers Features of RPCG: algorithm form comparable to PCG (additional products) only uses vector of size = number of observations suitable for reorthogonalization (if useful) sequence of iterates identical to that generated by PCG on primal good descent properties on J(s)! numerically stable for range-space perturbations any (2nd level) preconditioner F for the primal PCG translates to a preconditioner G for the range-space formulation iff FH T = BH T G Note: works for limited memory preconditioners Philippe Toint (naxys) Oxford, December 2015 20 / 45

The 4DVAR approach to data assimilation Range-space iterative solvers Numerical performance on ocean data assimilation? 1 NEMOVAR: Global 3D-Var system seasonal forecasting ocean reanalysis 2 ROMS California Current Ocean Modeling: Regional 4D-Var system coastal applications Implementation details: m 10 5 and n 10 6 The model variables: temperature, height, salinity and velocity. 1 outer loop of Gauss-Newton (k = 1), 40 inner loops Gürol, Weaver, Moore, Piacentini, Arango, Gratton, 2013. Quarterly Journal of the Royal Meteorological Society. In press. good numerical performance reorthogonalization sometimes necessary Philippe Toint (naxys) Oxford, December 2015 21 / 45

The 4DVAR approach to data assimilation In the nonlinear setting Impact of nonlinearity When solving the general 4DVAR problem... Several (outer) Gauss-Newton iterations using a (range-space) trust-region framework Question: Can one reuse the range-space preconditioner from previous iteration???? F k 1 H T k = BHT k G k 1??? In general: No! Philippe Toint (naxys) Oxford, December 2015 22 / 45

Some fixes The 4DVAR approach to data assimilation Impact of nonlinearity The old preconditioner is no longer symmetric in the new metric! How to get around this problem: 1 avoid (2nd level) preconditioning??? (last resort decision) 2 recompute the preconditioner in the new metric?? (possible with limited memory preconditioners, but costly) 3 ignore the problem and take one s chances? (only reasonable if convergence can still be guaranteed) Philippe Toint (naxys) Oxford, December 2015 23 / 45

The 4DVAR approach to data assimilation Taking measured risks... Impact of nonlinearity A simple proposal for computing the step: 1 compute the Cauchy step (1rst step of RPCG) 2 if negative curvature, recompute a complete step without 2nd level preconditioner 3 otherwise, use equivalence with primal to check simple decrease of f at the Cauchy point 4 if no decrease, then unsuccessful TR outer iteration 5 otherwise, continue RPCG with old preconditioner 6 if unsuccessful, go back to Cauchy point Can be interpreted as a TR magical step S. Gratton, S. Gürol and Ph. L. Toint, 2013. Preconditioning and globalizing conjugate-gradients in dual space for quadratically penalized nonlinear-least squares problems. Computational Optimization and Applications 54: 1-25 S. Gürol, PhD thesis, 2013 Philippe Toint (naxys) Oxford, December 2015 24 / 45

The 4DVAR approach to data assimilation Does it work? (1) Impact of nonlinearity Numerical experiment with a nonlinear heat equation f (x) = exp[1x] f (x) = exp[2x] Philippe Toint (naxys) Oxford, December 2015 25 / 45

The 4DVAR approach to data assimilation Does it work? (2) Impact of nonlinearity f (x) = exp[3x] f (x) = exp[3x] The global convergence is ensured by using the risky trust-region algorithm. This algorithm requires an additional function evaluation at the Cauchy point for each outer loop. Philippe Toint (naxys) Oxford, December 2015 26 / 45

The 4DVAR approach to data assimilation Conclusions for Section 1 Conclusions for Section 1 Multigrid approach useful for gandling background covariance matrix Range-space approach efficient in some data assimilation problems Suitable 2nd level preconditioners can be built Potential symmetry problem solved without compromising convergence Already in use in real operational systems Philippe Toint (naxys) Oxford, December 2015 27 / 45

Observation thinning Data thinning Observation hierarchy In many applications, too many observations in some parts of the domain! observations can be considered into a nested hierarchy {O i } r i=0 with O i O i+1 i = 0,... r 1. (coarse vs fine) Can we exploit this for reducing computations? Philippe Toint (naxys) Oxford, December 2015 28 / 45

Data thinning The coarse and fine subproblems Observation hierarchy The fine (sub)problem: The coarse (sub)problem: 1 min x + s s 2 f x b B 1 + 1 H 2 f s d f 2 R 1 f 1 min x + s s 2 c x b B 1 + 1 Γ 2 f (H f s d f ) 2 Rc 1 where Γ f is the restriction from fine to coarse observations. Moreover fine problem formulation = fine multiplier λ f coarse problem formulation = coarse multiplier λ c Philippe Toint (naxys) Oxford, December 2015 29 / 45

A useful error bound Data thinning Observation hierarchy Question: what is the difference between fine and coarse multipliers? If Π c = σγ f ones, then is the prolongation from the coarse observations to the fine λ f Π c λ c Rf +H f BH T f d f H f s c R f Π c λ c R 1 f (proof somewhat technical... ) Uses d f but no comptuted quantity at the fine level Observation i useful if the i-th component of λ f Π c λ c is large Philippe Toint (naxys) Oxford, December 2015 30 / 45

How to exploit this? Data thinning Observation hierarchy Idea: Starting from the coarsest observation set, and until the finest observation set is used: 1 solve the coarse problem for (s c, λ c ) 2 define a finer auxiliary problem by moving up in the hierarchy of observation sets (i.e. consider finer auxiliary observations) 3 use theorem to estimate distances from λ c to λ aux = Π c λ c 4 using this, select a subset of the auxiliary observations whose impacts represents the impacts of these observations well enough (thinning) 5 redefine this selection as the next coarse observation set and loop (needs: a more formal definition of the observations hierarchy + selection procedure) Philippe Toint (naxys) Oxford, December 2015 31 / 45

Selected fine set Philippe Toint (naxys) Oxford, December 2015 32 / 45 Data thinning An example of observation sets Observation hierarchy Coarse set Auxilary set

Data thinning Computational examples Example 1: The Lorenz96 chaotic system (1) Find u 0, where ū is an N-equally spaced entries around a circle, obeying du j+θ dt = 1 κ ( u j+θ 2u j+θ 1 + u j+θ 1 u j+θ+1 u j+θ + F ), (j = 1,... 400, θ = 1,..., 120) Coordinate system Initial (u 1 (0), u 2 (0),..., u N (0)) Philippe Toint (naxys) Oxford, December 2015 33 / 45

Data thinning Computational examples Example 1: The Lorenz96 chaotic system (2) System over space and time Window of assimilation Philippe Toint (naxys) Oxford, December 2015 34 / 45

Data thinning The Lorenz96 chaotic system (3) Computational examples Background and true initial u(0) True and computed u(0) Philippe Toint (naxys) Oxford, December 2015 35 / 45

Data thinning Computational examples Example 1: The Lorenz96 chaotic system (4) An example of transition from coarse to fine observations sets : O i O i+1 Philippe Toint (naxys) Oxford, December 2015 36 / 45

Data thinning Computational examples Example 1: The Lorenz96 chaotic system (5) Cost vs obs Cost vs flops RMS error versus time (last iteration) Philippe Toint (naxys) Oxford, December 2015 37 / 45

Data thinning Computational examples Example 2: 1D wave system with a shock (1) Find u 0 (z) in where f (u) = µe ηu 2 2 2 u(z, t) 2 u(z, t) + f (u) = 0, t z u(0, t) = u(1, t) = 0, u(z, 0) = u 0 (z), t u(z, 0) = 0, 0 t T, 0 z 1, (360 grid points, x 2.8 10 3, T = 1 and t = 1 64 ). Philippe Toint (naxys) Oxford, December 2015 38 / 45

Data thinning Computational examples Example 2: 1D wave system with a shock (2) Initial x 0 = u 0 (z) System over space and time Philippe Toint (naxys) Oxford, December 2015 39 / 45

Data thinning Computational examples Example 2: 1D wave system with a shock (3) Background and true initial u(0) True and computed u(0) Philippe Toint (naxys) Oxford, December 2015 40 / 45

Data thinning Computational examples Example 2: 1D wave system with a shock (4) An example of transition from coarse to fine observations sets : O i O i+1 Philippe Toint (naxys) Oxford, December 2015 41 / 45

Data thinning Computational examples Example 1: 1D wave system with a shock (5) Cost vs obs Cost vs flops RMS error versus time Philippe Toint (naxys) Oxford, December 2015 42 / 45

Data thinning Conclusions for Section 2 Conclusions for Section 2 Combining the advantages of the dual approach with adaptive observation thinning is possible Observation thinning can produce faster solutions Observation thinning can produce more accurate solutions Reuse of selected data sets along the nonlinear optimization? Use this idea for the design of observations campaigns? Philippe Toint (naxys) Oxford, December 2015 43 / 45

Final comments and some references Final comments large number of algorithmic challenges in 4DVar data assimilation (also true for other approaches such as ensemble methods) working in the (possibly thinned) dual space can be very advantageous some numerical analysis expertise truly useful dialog with practioners not always easy (requires use of real models) from-scratch rexamination of the computational chain necessary? Thank you for your attention! Philippe Toint (naxys) Oxford, December 2015 44 / 45

Reading Final comments and some references S. Gratton, Ph. L. Toint et J. Tshimanga, Conjugate-gradients versus multigrid solvers for diffusion-based correlation models in data assimilation, Quarterly Journal of the Royal Meteorological Society, vol. 139, pp. 1481-1487, 2013. S. Gratton, S. Gürol, Ph. L. Toint, Preconditioning and Globalizing Conjugate Gradients in Dual Space for Quadratically Penalized Nonlinear-Least Squares Problems, Computational Optimization and Applications, vol. 54(1), pp. 1-25, 2013.. S. Gratton, Ph. L. Toint, J. Tshimanga, Range-space variants and inexact matrix-vector products in Krylov solvers for linear systems arising from inverse problems, SIAM Journal on Matrix Analysis and Applications, vol. 32(3), pp. 969-986, 2011. S. Gratton, M. Rincon-Camacho, E. Simon, Ph. L. Toint, Observations Thinning In Data Assimilation, EURO Journal on Computational Optimization, vol.3, 31-51, 2015. Philippe Toint (naxys) Oxford, December 2015 45 / 45