Optimal solution error covariance in highly nonlinear problems of variational data assimilation

The problem of variational data assimilation (DA) for a nonlinear evolution model is formulated as an optimal control problem to find the initial condition, boundary conditions and/or model parameters. The input data contain observation and background errors, hence there is an error in the optimal solution. For mildly nonlinear dynamics, the covariance matrix of the optimal solution error can be approximated by the inverse Hessian of the cost function. For problems with strongly nonlinear dynamics, a new statistical method based on the computation of a sample of inverse Hessians is suggested. This method relies on the efficient computation of the inverse Hessian by means of iterative methods (Lanczos and quasi-Newton BFGS) with preconditioning. Numerical examples are presented for the model governed by the Burgers equation with a nonlinear viscous term.


Introduction
State and/or parameter estimation for dynamical geophysical flow models is an important problem in meteorology and oceanography.Among the few methods feasible for solving these non-stationary large-scale problems, the variational data assimilation (DA) method, called "4D-Var", is the preferred method implemented at some major operational centers (e.g.Courtier et al., 1994;Fisher et al., 2009).From the mathematical point of view, these problems can be formulated as optimal control problems (e.g.Lions, 1986;Le Dimet and Talagrand, 1986) to find unknown control variables in such a way that a cost function related to the observation and a priori data takes its minimum value.A necessary optimality condition leads to the so-called optimality system, which contains all the available information and involves the original and adjoint models.Due to the input errors (background and observation errors), there is an error in the op-timal solution.Its statistical properties are very important for quantifying the accuracy of the optimal solution (which is necessary to evaluate the quality of the forecast), for sequential variational state estimation and optimal design of observation schemes.Assuming that the probability density function (p.d.f.) of the optimal solution error can be reasonably approximated by the normal (Gaussian) distribution, the optimal solution error covariance matrix (referred to below simply as "covariance") is its most important statistic to be estimated.If the errors of the input data are random and normally distributed, then for a linearized finite-dimensional error evolution model, the covariance is given by the inverse Hessian of the cost function (e.g.Thacker, 1989;Rabier and Courtier, 1992).This is an extension of a well-known result from nonlinear regression (Draper and Smith, 1981) to the case of nonlinear dynamical systems.A similar result in the continuous case was presented by Gejadze et al. (2008).In terms of continuous representation, it is said that the covariance operator can be approximated by the inverse Hessian of the auxiliary control problem based on the tangent linear model (TLM) constraints, if the so-called tangent linear hypothesis (TLH) is valid.The TLH implies that the error dynamics can be satisfactorily described by the TLM.It was demonstrated by Gejadze et al. (2010Gejadze et al. ( , 2011) ) that approximation of the covariance by the inverse Hessian could be sometimes sufficiently accurate even though the TLH is not valid.However, in the case of highly nonlinear dynamics such an approximation may not be valid at all (see, for example, Pires et al., 1996).In the present paper, for the case under consideration, we do the following: (a) present an argument that even in this case the p.d.f. of the optimal solution error may still be represented by a normal distribution defined by the covariance matrix; (b) outline a new method for estimation of the covariance; (c) discuss implementation potentially feasible for large-scale dynamical models.One of the objectives of this paper is to highlight the concept of Published by Copernicus Publications on behalf of the European Geosciences Union & the American Geophysical Union.

V. Shutyaev et al.: Optimal solution error covariance
the Effective Inverse Hessian (EIH), first introduced by Gejadze et al. (2011), to the geophysical research community.The closest concept to this is probably the Expected Fisher Information Matrix used in Bayesian estimation theory.

Statement of the problem
Consider the mathematical model of a physical process that is described by the evolution problem ∂ϕ ∂t = F (ϕ), t ∈ (0,T ) where ϕ = ϕ(t) is the unknown function belonging for any t to a Hilbert space X, u ∈ X, F is a nonlinear operator mapping X into X.Let Y = L 2 (0,T ;X) be a space of abstract functions ϕ(t) with values in X, with the norm . Suppose that for a given u ∈ X there exists a unique solution ϕ ∈ Y to Eq. ( 1).
Let ū be the "exact" initial state and φ -the solution to the problem Eq. ( 1) with u = ū, i.e. the "exact" state evolution.We define the input data as follows: the background function In particular, Y o may be finite-dimensional (both in space and in time).The random variables ξ b and ξ o may be regarded as the background and the observation error, respectively.Assuming that these errors are normally distributed, unbiased and mutually uncorrelated, we define the covariance operators where "•" denotes an argument of the respective operator, and E is the expectation.We suppose that V b and V o are positive definite, hence invertible.
Let us introduce a cost function J (u) and formulate the following DA problem (optimal control problem) with the aim to identify the initial condition: find u ∈ X and ϕ ∈ Y such that they satisfy Eq. ( 1) and the cost function J (u) takes its minimum value.Further we assume that the optimal solution error δu = u − ū is unbiased, i.e.E[δu] = 0, with the covariance operator Let us introduce the operator R : X → Y o as follows where ψ ∈ Y is the solution of the tangent linear problem For a given v we solve the problem Eq. ( 4), and then find Rv by Eq. ( 3).The definition of R involves ϕ = φ + δϕ dependent on u = ū + δu via Eq.( 1), thus we can write as follows: R = R( ū,δu).It has been shown in (Gejadze et al., 2008) that the optimal solution error δu = u − ū and data errors ξ b and ξ o are related via the following exact operator equation where R * is the adjoint to R and τ ∈ [0,1] is a parameter chosen to make the truncated Taylor series exact. Let ) be the Hessian of the linearized (auxiliary) control problem (Gejadze et al., 2008).Under the hypothesis that F is twice continuously Fréchet differentiable, the error Eq. ( 5) is approximated by: From Eq. ( 6) it is easy to see that This is a well-established result (Courtier et al., 1994;Rabier and Courtier, 1992;Thacker, 1989), which is usually deduced (without considering Eq. 5) by straightforwardly linearizing the original nonlinear DA problem Eqs. ( 1)-( 2) under the assumption that which is called the "tangent linear hypothesis".It is said that V δu can be approximated by [H ( ū)] −1 if the TLH Eq. ( 8) is valid.That usually happens if the nonlinearity is mild and/or the error δu and, subsequently, δϕ are small.We derive Eq. ( 7) via Eq.( 5).From this derivation one can see that the accuracy of Eq. ( 7) depends on the accuracy of the approximations R( ū,τ δu) ≈ R( ū,0) and R * ( ū,δu) ≈ R * ( ū,0) in Eq. ( 5).Clearly, the transition from Eq. (5) to Eq. ( 6) could still be valid even though Eq. ( 8) is not satisfied.
As already mentioned, we can use formula Eq. ( 7) if the TLH is valid and, in some cases beyond the range of its validity.In the general case, however, one may not expect H −1 ( ū) always to be a satisfactory approximation to V δu .In Fig. 1 we present a specially designed example for the evolution model governed by the 1-D Burgers equation (for details see Sect. 4).The difference between the reference value of the variance (circles) and the inverse Hessian based value (bold solid line) can be clearly seen within the ellipse.The reference variance is obtained by a direct Monte Carlo simulation.
Since R * ( ū,0) and H ( ū) in Eq. ( 6) are linear operators and we assume that errors ξ b and ξ o are unbiased and normally distributed, then δu ∼ N (0,V δu ).Clearly, this result is valid as far as the TLH and consequently Eq. ( 6) itself are satisfied.However, for highly nonlinear dynamical models the TLH often breaks down (e.g.Pires et al., 1996); thus, we have to Since R * (ū,0) and H(ū) in ( 6) are linear operators and 160 we assume that errors ξ b and ξ o are unbiased and normally distributed, then δu ∼ N (0,V δu ).Clearly, this result is valid as far as the TLH and consequently (6) itself are satisfied.However, for highly nonlinear dynamical models the TLH often breaks down (e.g., Pires et al., 1996); thus, we have to answer the following question: can the p.d.f. of δu still be approximated by the normal distribution?If the answer is positive, one should look for a better approximation of the covariance than that given by (7).
Let us consider the cost function (2), but without the 170 background term.The corresponding error equation ( 5) is then as follows: For a univariate case, the classical result (see (Jennrich, 1969)) is that δu is asymptotically normal if ξ o is an 175 independent identically distributed (i.i.d.) random variable with E[ξ o ] = 0 and E[ξ 2 o ] = σ 2 < ∞ ('asymptotically' means that T → ∞ given the finite observation time step dt, or dt → 0 given the finite observation window [0,T ]).Let us stress that for the asymptotic normality of δu the error 180 ξ o is not required to be normal.This original result has been generalized to the multivariate case and to the case of dependent, yet identically distributed observations (White and Domowitz, 1984), whereas an even more general case is considered in (Yuan and Jennrich, 1998).Here we 185 consider the complete cost function (2) and, correspondingly, the error equation ( 5), which contains terms related to the background term.To analyze a possible impact of these terms let us follow the reasoning in (Amemiya, 1983), pp. 337-345, where the error equation equivalent to (9) is 190 derived in a slightly different form.It is concluded that the error δu is asymptotically normal when: a) the righthand side of the error equation is normal; b) the left-hand side matrix converges in probability to a non-random value.
These conditions are met under certain general regularity 195 requirements to the operator R, which are incomparably weaker than the TLH and do not depend on the magnitude of the input errors.Clearly, as applied to (5), the first condition holds if ξ b is normally distributed.Since V −1 b is a constant matrix, the second condition always holds as long 200 as it holds for R * (ū,δu)V −1 o R(ū,τ δu).Therefore, one may conclude that δu from ( 5) is bound to remain asymptotically normal.In practice the observation window [0,T ] and time step dt are always finite implying the finite number of i.i.d.observations.Moreover, it is not easy to assess how 205 large the number of observations must be for the desired asymptotic properties to be reasonably approximated.Some nonlinear least-square problems in which the normality of the estimation error holds for 'practically relevant' sample sizes are said to exhibit a 'close-to-linear' statistical behavior 210 (Ratkowsky, 1983).The method suggested in (Ratkowsky, 1983) to verify this behavior is, essentially, a normality test applied to a generated sample of optimal solutions, which is hardly feasible for large-scale applications.Nevertheless, for certain highly nonlinear evolution models it is reasonable to 215 expect that the distribution of δu might be reasonably close to normal if the number of i.i.d.observations is significant in time and the observation network is sufficiently dense in space.This may happen in assimilation of long time series of satellite observations of ocean surface elevation and 220 temperature, for example.

General consideration
Here we present a new method for estimating the covariance V δu to be used in the case of highly nonlinear dynamics,

225
when [H(ū)] −1 is not expected to be a good approximation of V δu .Let us consider the discretized nonlinear error equation ( 5) and denote by H the left-hand side operator in (5).Then we can write down the expression for δu: whereas for the covariance V δu we obtain as follows: As a result of a series of simplifications described in (Gejadze et al., 2011) the above equation can be reduced to the form 235 where is the Hessian of the linearized (auxiliary) control problem.The right-hand side of (11) may be called the effective inverse Hessian (EIH), hence the name of the suggested method.answer the following question: can the p.d.f. of δu still be approximated by the normal distribution?If the answer is positive, one should look for a better approximation of the covariance than that given by Eq. ( 7).
Let us consider the cost function Eq. ( 2), but without the background term.The corresponding error equation Eq. ( 5) is then as follows: For a univariate case, the classical result (see Jennrich, 1969) is that δu is asymptotically normal if ξ o is an independent identically distributed (i.i.d.) random variable with E[ξ o ] = 0 and E[ξ 2 o ] = σ 2 < ∞ ("asymptotically" means that T → ∞ given the finite observation time step dt, or dt → 0 given the finite observation window [0,T ]).Let us stress that for the asymptotic normality of δu, the error ξ o is not required to be normal.This original result has been generalized to the multivariate case and to the case of dependent, yet identically distributed observations (White and Domowitz, 1984), whereas an even more general case is considered in (Yuan and Jennrich, 1998).Here we consider the complete cost function Eq. ( 2) and, correspondingly, the error Eq. ( 5), which contains terms related to the background term.To analyze a possible impact of these terms let us follow the reasoning in (Amemiya, 1983), pp.337-345, where the error equation equivalent to Eq. ( 9) is derived in a slightly different form.It is concluded that the error δu is asymptotically normal when: (a) the right-hand side of the error equation is normal; (b) the left-hand side matrix converges in probability to a non-random value.These conditions are met under certain general regularity requirements to the operator R, which are incomparably weaker than the TLH and do not depend on the magnitude of the input errors.Clearly, as applied to Eq. ( 5), the first condition holds if ξ b is normally distributed.Since V −1 b is a constant matrix, the second condition always holds as long as it holds for R * ( ū,δu)V −1 o R( ū,τ δu).Therefore, one may conclude that δu from Eq. ( 5) is bound to remain asymptotically normal.In practice the observation window [0,T ] and time step dt are always finite implying the finite number of i.i.d.observations.Moreover, it is not easy to assess how large the number of observations must be for the desired asymptotic properties to be reasonably approximated.Some nonlinear least-square problems, in which the normality of the estimation error holds for "practically relevant" sample sizes, are said to exhibit a "close-to-linear" statistical behavior (Ratkowsky, 1983).The method suggested in (Ratkowsky, 1983) to verify this behavior is, essentially, a normality test applied to a generated sample of optimal solutions, which is hardly feasible for large-scale applications.
Nevertheless, for certain highly nonlinear evolution models, it is reasonable to expect that the distribution of δu might be reasonably close to normal if the number of i.i.d.observations is significant in time and the observation network is sufficiently dense in space.This may happen in assimilation of long time series of satellite observations of ocean surface elevation and temperature, for example.

General consideration
Here we present a new method for estimating the covariance V δu to be used in the case of highly nonlinear dynamics, when [H ( ū)] −1 is not expected to be a good approximation of V δu .Let us consider the discretized nonlinear error equation Eq. ( 5) and denote by H the left-hand side operator in Eq. ( 5).Then we can write down the expression for δu whereas for the covariance V δu we obtain as follows: As a result of a series of simplifications described in (Gejadze et al., 2011) the above equation can be reduced to the form where is the Hessian of the linearized (auxiliary) control problem.The righthand side of Eq. ( 11) may be called the effective inverse Hessian (EIH), hence the name of the suggested method.In order to compute V directly using this equation, the expectation is substituted by the sample mean: The main difficulty with the implementation is a need to compute a sample of optimal solutions u l = ū + δu l .However, formula Eq. ( 11) does not necessarily require u l to be an optimal solution.If we denote by q δu the p.d.f. of δu, then equation Eq. ( 11) can be rewritten in the form: If we assume that in our nonlinear case the covariance matrix V describes meaningfully the p.d.f. of the optimal solution error, then, with the same level of validity, we should also accept the pdf q δu to be approximately normal with zero expectation and the covariance V , in which case we obtain where |V | 1/2 .Formula Eq. ( 12) gives V explicitly, but requires a sample of optimal solutions u l , l = 1,...,L to be computed.In contrast, the latest expression is a nonlinear matrix integral equation with respect to V , while v is a dummy variable.This equation is actually solved using the iterative process Eq. ( 19), as explained in the following section.It is also interesting to notice that Eq. ( 14) is a deterministic equation.

Implementation remarks
Remark 1. Preconditioning is used in variational DA to accelerate the convergence of the conjugate gradient algorithm at the stage of inner iterations of the Gauss-Newton (GN) method, but it also can be used to accelerate formation of the inverse Hessian by the Lanczos algorithm (Fisher et al., 2009) or by the BFGS (Gejadze et al., 2010).Since H is selfadjoint, we must consider a projected Hessian in a symmetric form with some operator B : X → X, defined in such a way that the eigenspectrum of the projected Hessian H is clustered around 1, i.e. the majority of the eigenvalues of H are equal or close to 1. Since the condition number of H is supposed to be much smaller than the condition number of H , a sensible approximation of H −1 can usually be obtained (either by Lanczos or BFGS) with a relatively small number of iterations.After that, having H −1 , one can easily recover H −1 using the formula: Assuming that B −1 does not depend on δu l , we substitute Eq. ( 15) into Eq.( 12) and obtain the version of Eq. ( 12) with preconditioning: Similarly, assuming that B −1 does not depend on the variable of integration, we substitute Eq. ( 15) into Eq.( 14) and obtain the version of Eq. ( 14) with preconditioning: Formulas Eq. ( 16) and Eq. ( 17) instead of H −1 involve H −1 which is much less expensive to compute and store in memory.Let us mention here that the EIH method would hardly be feasible for large-scale problems without appropriate preconditioning.
Remark 2. The nonlinear Eq. ( 17) can be solved, for example, by the fixed point iterative process as follows for p = 0,1,..., starting with V 0 = [H ( ū)] −1 .The iterative processes of this type are expected to converge if V 0 is a good initial approximation of V , which is the case in the considered examples.The convergence of Eq. ( 18) and other methods for solving equation Eq. ( 17) are subjects for future research.
Remark 3. Different methods can be used for evaluation of the multidimensional integral in Eq. ( 18) such as quasi-Monte Carlo (Neiderreiter, 1992).Here, for simplicity, we use the standard Monte Carlo method.This actually implies a return to the formula Eq. ( 16).Taking into account Eq. ( 15), the iterative process takes the form where δu p l ∼ N (0,V p ).For each l, we compute δu p l as follows δu p l = (V p ) 1/2 ξ l , where ξ ∼ N (0,I ) is an independent random series, I is the identity matrix and (V p ) 1/2 is the square root of V p .One can see that for each p the last formula looks similar to Eq. ( 16) with one key difference: δu p l in Eq. ( 19) is not an optimal solution, but a vector having the statistical properties of the optimal solution.
Remark 4. Let us notice that a few tens of outer iterations by the GN method may be required to obtain one optimal solution, while an approximate evaluation of H −1 is equivalent (in terms of computational costs) to just one outer iteration of the GN method.One has to repeat these computations p times, however, only a few iterations on index p are required in practice.Therefore, one should expect an order of the magnitude reduction of computational costs by the method Eq. ( 19) as compared to Eq. ( 16) for the same sample size.Clearly, for realistic large-scale models, the sample size L is going to be limited.Probably, the minimum ensemble size for this method to work is 2L * + 1, where L * is the accepted number of leading eigenvectors of V p in Eq. ( 19).
Remark 5.In order to implement the process Eq. ( 19) a sample of vectors ϕ l (x,0) = δu p l must be propagated from t = 0 to t = T using the nonlinear model Eq.(1).Therefore, for each p one gets a sample of final states ϕ l (x,T ) consistent with the current approximation of V p , which can be used to evaluate the forecast and forecast covariance.Since V p is a better approximation of the analysis error covariance than simply [H ( ū)] −1 , one should expect a better quality of the forecast and covariance (as being consistent with V p , rather than with [H ( ū)] −1 ).

Numerical model
As a model we use the 1D Burgers equation with a nonlinear viscous term: The nonlinear diffusion term with µ(ϕ) dependent on ∂ϕ/∂x is introduced to mimic the eddy viscosity (turbulence), which depends on the field gradients (pressure, temperature), rather than on the field value itself.This type of µ(ϕ) also allows us to formally qualify the problem Eqs. ( 20)-( 22) as strongly nonlinear (Fučik and Kufner, 1980).Let us mention that the Burgers equations are sometimes considered in DA context as a simple model describing the atmospheric flow motion.We use the implicit time discretization as follows where i = 1,...,N is the time integration index, h t = T /N is the time step.The spatial operator is discretized on a uniform grid (h x is the spatial discretization step, j = 1,...,M is the node number, M is the total number of grid nodes), using the "power law" first-order scheme as described in (Patankar, 1980), which yields quite a stable discretization scheme (this Remark 4. Let us notice that a few tens of outer iterations by the GN method may be required to obtain one optimal solution, while an approximate evaluation of H−1 is equivalent (in terms of computational costs) to just one outer iteration of the GN method.One has to repeat these computations p times, however only a few iterations on index p are required in practice.Therefore, one should expect an order of the magnitude reduction of computational costs by the method (19) as compared to ( 16) for the same sample size.Clearly, for realistic large-scale models the sample size L is going to be limited.Probably, the minimum ensemble size for this method to work is 2L * + 1 , where L * is the accepted number of leading eigenvectors of V p in (19).
Remark 5.In order to implement the process (19) a sample of vectors ϕ l (x,0) = δu p l must be propagated from t = 0 to t = T using the nonlinear model (1).Therefore, for each p one gets a sample of final states ϕ l (x,T ) consistent with the current approximation of V p , which can be used to evaluate the forecast and forecast covariance.Since V p is a better approximation of the analysis error covariance than simply [H(ū)] −1 , one should expect a better quality of the forecast and covariance (as being consistent with V p , rather 345 than with [H(ū)] −1 ).

Numerical model
As a model we use the 1D Burgers equation with a nonlinear viscous term: and the viscosity coefficient The nonlinear diffusion term with µ(ϕ) dependent on ∂ϕ/∂x is introduced to mimic the eddy viscosity (turbulence), which depends on the field gradients (pressure, temperature), rather than on the field value itself.This type of µ(ϕ) also allows us to formally qualify the problem ( 20)-( 22) as strongly nonlinear (Fučik and Kufner, 1980).Let us mention that the Burgers equations are sometimes considered in DA context as a simple model describing the atmospheric flow motion.We use the implicit time discretization as follows: where i = 1,...,N is the time integration index, h t = T /N is the time step.The spatial operator is discretized on a uniform grid (h x is the spatial discretization step, j = 1,...,M is the node number, M is the total number of grid nodes) using 370 the 'power law' first-order scheme as described in (Patankar, 1980), which yields quite a stable discretization scheme (this scheme allows µ(ϕ) to be as small as 0.5 × 10 −4 for M = 200 without noticeable oscillations).For each time step we perform nonlinear iterations on the coefficients w(ϕ) = ϕ 375 and µ(ϕ) in the form for n = 1,2,..., assuming initially that µ(ϕ i 0 ) = µ(ϕ i−1 ) and w(ϕ i 0 ) = ϕ i−1 , and keep iterating until ( 23) is satisfied (i.e. the norm of the left-hand side in (23) becomes smaller than 380 the threshold ǫ 1 = 10 −12 √ M ).In all the computations presented in this paper we use the following parameters: the observation period T = 0.312, the discretization steps h t = 0.004, h x = 0.005, the state vector dimension M = 200, and the parameters in ( 22) µ 0 = 10 −4 , µ 1 = 10 −6 .

385
A general property of the Burgers solutions is that a smooth initial state evolves into a state characterized by the areas of severe gradients (or even shocks in the inviscid case).These are precisely the areas of a strong nonlinearity where one might expect violations of the TLH and, subsequently, 390 the invalidity of (7).For numerical experiments we choose a certain initial condition which stimulates the highly nonlinear behavior of the system; this is given by the formula scheme allows µ(ϕ) to be as small as 0.5 × 10 −4 for M = 200 without noticeable oscillations).For each time step we perform nonlinear iterations on the coefficients w(ϕ) = ϕ and µ(ϕ) in the form for n = 1,2,..., assuming initially that µ(ϕ i 0 ) = µ(ϕ i−1 ) and w(ϕ i 0 ) = ϕ i−1 , and keep iterating until Eq. ( 23) is satisfied (i.e. the norm of the left-hand side in Eq. ( 23) becomes smaller than the threshold 1 = 10 −12 √ M).In all the computations presented in this paper we use the following parameters: the observation period T = 0.312, the discretization steps h t = 0.004, h x = 0.005, the state vector dimension M = 200, and the parameters in Eq. ( 22) µ 0 = 10 −4 , µ 1 = 10 −6 .
A general property of the Burgers solutions is that a smooth initial state evolves into a state characterized by the areas of severe gradients (or even shocks in the inviscid case).These are precisely the areas of a strong nonlinearity where one might expect violations of the TLH and, subsequently, the invalidity of Eq. ( 7).For numerical experiments we choose a certain initial condition that stimulates the highly nonlinear behavior of the system; this is given by the formula: The resulting field evolution ϕ(x,t) is presented in Fig. 2.

BFGS for computing the inverse Hessian and other details
The projected inverse Hessian H ( ū + δu) is computed as a collateral result of the BFGS iterations while solving the members and into twenty five subsets including L = 100 members.Let us denote by VL the sample covariance matrix obtained for a subset including L members.Then, the relative error in the sample variance (which is the relative 440 sampling error) can be defined as the vector εL with the components ..,M.The relative error in a certain approximation of V is defined as a vector ε with the components We compute this error with V in ( 27) being estimated by one of the following methods: 1) by the inverse Hessian method, i.e. simply using lines 2 and 3 -to the methods 2a and 2b (variants of the EIH method).following auxiliary DA problem: where The preconditioner used in our method is In order to compute [ H ( ū)] −1/2 we apply the Cholesky factorization of the explicitly formed matrix H −1 .However, it is important to note that the square-root-vector product H −1/2 w can be computed using a recursive procedure based on the accumulated secant pairs (BFGS) or eigenvalues/eigenvectors (Lanczos) as described in (Tshimanga et al., 2008), without the need to form H −1 and to factorize it.Consistent tangent linear and adjoint models have been generated from the original forward model by the Automatic Differentiation tool TAPENADE (Hascoët and Pascual, 2004) and checked using the standard gradient test.The background error covariance V b is computed assuming that the background error belongs to the Sobolev space W 2 2 [0,1] (see Gejadze et al., 2010, for details).The correlation function used in the numerical examples is as presented in Fig. 3, the background error variance is σ 2 b = 0.2, the observation error variance is σ 2 o = 10 −3 .The observation scheme consists of 4 sensors located at the points xk = 0.4, 0.45, 0.55, 0.6, and the observations are available at each time instant.

Numerical results
First we computed a large sample (L = 2500) of optimal solutions u l by solving L times the data assimilation problem Eqs. ( 1)-( 2) with perturbed data u b = ū+ξ b and y = C φ +ξ o , where ξ b ∼ N (0,V b ) and ξ o ∼ N 0,σ 2 o I .This large sample was used to evaluate the sample covariance matrix, which was further processed to filter the sampling error (as described in Gejadze et al., 2011); the outcome was considered as a reference value V • .Then, the original large sample was partitioned into one hundred subsets including L = 25 members and into twenty five subsets including L = 100 members.Let us denote by VL the sample covariance matrix obtained for a subset including L members.Then, the relative error in the sample variance (which is the relative sampling error) can be defined as the vector εL with the components: The relative error in a certain approximation of V is defined as a vector ε with the components: We compute this error with V in Eq. ( 27) being estimated by one of the following methods: 1. by the inverse Hessian method, i.e. simply using 2a. by the EIH method implemented in the form Eq. ( 16), which requires a sample of optimal solutions δu l to be computed; 2b. by the EIH method implemented as the iterative process Eq. ( 19), which requires a sample of δu l , but does not require that δu l are optimal solutions.
For the computation of V by the methods 2a or 2b a sample of δu l is required, hence, the result depends on the sample size L. The results (obtained by the methods 2a and 2b) presented in this paper are computed with L = 100.In the method 2b we currently allow enough iterations on the index p for the iterative process Eq. ( 19) to converge in terms of the distance between the successive iterates.In practice, this requires just a few iterations, typically 2 − 3.
In the upper panel in Fig. 4, a set of one hundred vectors ε25 is presented in dark lines, and a set of twenty five vectors ε100 -in the overlaying white lines.These plots reveal the envelopes for the relative error in the sample variance obtained with L = 25 and L = 100, respectively.The graphs of ε are presented in the lower panel: line 1 corresponds to the method 1 (the inverse Hessian method, see also Fig. 1), lines 2 and 3 -to the methods 2a and 2b (variants of the EIH method).Looking at Fig. 4 we observe that the relative error in the sample variance ε25 (dark envelope) exceeds 50% almost everywhere, which is certainly beyond reasonable margins, 475 and ε100 (white envelope) is around 25% (that is still fairly large).In order to reduce the white envelope two times, one would need to use the sample size L = 400, etc.One should also keep in mind that the relative error in the diagonal elements of the sample covariance matrix is the smallest 480 as compared to its sub-diagonals, i.e. the envelopes for any sub-diagonal would be wider than those presented in Fig. 4(up).Thus, the development of methods for estimating the covariance (alternative to the direct sampling method) is an important task.

485
Whereas the method 1 (the inverse Hessian method) gives an estimate of V δu with a small relative error (as compared to the sample covariance) in the areas of mild nonlinearity, this error can be much larger in the areas of high nonlinearity.For example, if we imagine that the lower panel in Fig. 4 is 490 superposed over its upper panel, then one could observe line 1 jumping outside the dark envelope in the area surrounding x = 0.5, i.e. the relative error by the inverse Hessian is significantly larger here than the sampling error for L = 25.At the same time, the relative error obtained by the 495 methods 2a and 2b is much smaller as compared to the error in line 1 and it would largely remain within the white envelope.The difference between the estimates by the methods 2a and 2b does not look significant.The best improvement can be achieved for the diagonal elements of 500 V δu (the variance).Thus, the covariance estimate by the EIH method is noticeably better than the sample covariance obtained with the equivalent sample size.The suggested algorithm is computationally efficient (in terms of the CPU time) if the cost of computing the inverse Hessian is much 505 less than the cost of computing one optimal solution.In the example presented in this paper one limited-memory inverse Hessian is about 20-30 times less expensive than one optimal solution.Thus, on average, the algorithm 2b works about 10 times faster than the algorithm 2a, whereas the results by 510 both the algorithms are similar in terms of accuracy.

Conclusions
Error propagation is a key point in modeling the large-scale geophysical flows, with the main difficulty being linked to the nonlinearity of the governing equations.In this paper 515 we consider the hind-cast (initialization) DA problem.From the mathematical point of view, this is the initial-value control problem for a nonlinear evolution model governed by partial differential equations.Assuming the so-called tangent linear hypothesis (TLH) holds, the covariance is 520 often approximated by the inverse Hessian of the objective function.In practice, the same approximation could be valid even though the TLH is clearly violated.However, here we deal with such a highly nonlinear dynamics that the inverse Hessian approach is no longer valid.In this case 525 a new method for computing the covariance matrix named the 'effective inverse Hessian' method can be used.This method yields a significant improvement in the covariance estimate as compared to the inverse Hessian.The method is potentially feasible for large-scale applications because it 530 can be used in a multiprocessor environment and operates in terms of the Hessian-vector products.The software blocks needed for its implementation are the standard blocks of any existing 4D Var system.All the results of this paper are consistent with the assumption of a 'close-to-normal' 535 nature of the optimal solution error.This should be expected taking into account the consistency and asymptotic normality of the estimator and the fact that the observation window in variational DA is usually quite large.In this case the covariance matrix is a meaningful representative of the 540 p.d.f.The method suggested may become a valuable option for uncertainty analysis in the framework of the classical 4D-VAR approach when applied to highly nonlinear DA problems.Looking at Fig. 4, we observe that the relative error in the sample variance ε25 (dark envelope) exceeds 50% almost everywhere, which is certainly beyond reasonable margins, and ε100 (white envelope) is around 25% (that is still fairly large).In order to reduce the white envelope two times, one would need to use the sample size L = 400, etc.One should also keep in mind that the relative error in the diagonal elements of the sample covariance matrix is the smallest as compared to its sub-diagonals, i.e. the envelopes for any subdiagonal would be wider than those presented in Fig. 4(up).Thus, the development of methods for estimating the covariance (alternative to the direct sampling method) is an important task.

Acknowledgements. The first author acknowledges the Russian
Whereas the method 1 (the inverse Hessian method) gives an estimate of V δu with a small relative error (as compared to the sample covariance) in the areas of mild nonlinearity, this error can be much larger in the areas of high nonlinearity.For example, if we imagine that the lower panel in Fig. 4 is superposed over its upper panel, then one could observe line 1 jumping outside the dark envelope in the area surrounding x = 0.5, i.e. the relative error by the inverse Hessian is significantly larger here than the sampling error for L = 25.
At the same time, the relative error obtained by the methods 2a and 2b is much smaller as compared to the error in line 1 and it would largely remain within the white envelope.The difference between the estimates by the methods 2a and 2b does not look significant.The best improvement can be achieved for the diagonal elements of V δu (the variance).Thus, the covariance estimate by the EIH method is noticeably better than the sample covariance obtained with the equivalent sample size.The suggested algorithm is computationally efficient (in terms of the CPU time) if the cost of computing the inverse Hessian is much less than the cost of computing one optimal solution.In the example presented in this paper one limited-memory inverse Hessian is about 20-30 times less expensive than one optimal solution.Thus, on average, the algorithm 2b works about 10 times faster than the algorithm 2a, whereas the results by both the algorithms are similar in terms of accuracy.

Conclusions
Error propagation is a key point in modeling the large-scale geophysical flows, with the main difficulty being linked to the nonlinearity of the governing equations.In this paper we consider the hind-cast (initialization) DA problem.From the mathematical point of view, this is the initial-value control problem for a nonlinear evolution model governed by partial differential equations.Assuming the so-called tangent linear hypothesis (TLH) holds, the covariance is often approximated by the inverse Hessian of the objective function.In practice, the same approximation could be valid even though the TLH is clearly violated.However, here we deal with such a highly nonlinear dynamics that the inverse Hessian approach is no longer valid.In this case, a new method for computing the covariance matrix, named the "effective inverse Hessian" method, can be used.This method yields a significant improvement in the covariance estimate as compared to the inverse Hessian.The method is potentially feasible for large-scale applications because it can be used in a multiprocessor environment and operates in terms of the Hessian-vector products.The software blocks needed for its implementation are the standard blocks of any existing 4-D Var system.All the results of this paper are consistent with the assumption of a "close-to-normal" nature of the optimal solution error.This should be expected, taking into account the consistency and asymptotic normality of the estimator and the fact that the observation window in variational DA is usually quite large.In this case the covariance matrix is a meaningful representative of the p.d.f.The method suggested may become a valuable option for uncertainty analysis in the framework of the classical 4D-VAR approach when applied to highly nonlinear DA problems.

Fig. 4 .
Fig. 4. Up: the sample relative error ε.Set of ε for L = 25dark envelope and set of ε for L = 100 -white envelope.Down: the relative error ε by the inverse Hessian -line 1, and by the EIH methods with L = 100: method 2a -line 2; method 2b -line 3.

545
Foundation for Basic Research and Russian Federal Research

Fig. 4 .
Fig. 4. Up: the sample relative error ε.Set of ε for L = 25 -dark envelope and set of ε for L = 100 -white envelope.Down: the relative error ε by the inverse Hessian -line 1, and by the EIH methods with L = 100: method 2a -line 2; method 2b -line 3.