Quasi static ensemble variational data assimilation

The analysis in nonlinear variational data assimilation is the solution of a non-quadratic minimization. Thus, the analysis efficiency relies on its ability to locate a global minimum of the cost function. If this minimization uses a GaussNewton (GN) method, it is critical for the starting point to be in the attraction basin of a global minimum. Otherwise the method may converge to a local extremum, which degrades the analysis. With chaotic models, the number of local extrema often increases with the temporal extent of the data assimilation window, making the former condition harder to satisfy. This 5 is unfortunate because the assimilation performance also increases with this temporal extent. However, a quasi-static (QS) minimization may overcome these local extrema. It consists in gradually injecting the observations in the cost function. This method was introduced by Pires et al. (1996) in a 4D-Var context. We generalize this approach to four-dimensional nonlinear EnVar methods, which are based on both a nonlinear variational analysis and the propagation of dynamical error statistics via an ensemble. This forces to consider the cost function mini10 mizations in the broader context of cycled data assimilation algorithms. We adapt this QS approach to the iterative ensemble Kalman smoother (IEnKS), an exemplar of nonlinear deterministic 4D EnVar methods. Using low-order models, we quantify the positive impact of the QS approach on the IEnKS, especially for long data assimilation windows. We also examine the computational cost of QS implementations and suggest cheaper algorithms.


Introduction
Data assimilation (DA) aims at gathering knowledge about the state of a system from acquired observations.In the Bayesian framework, this knowledge can be represented by the posterior probability density function (pdf) of the system state given the observations.A specificity of sequential DA is that observations are not directly available; they become available as time goes by.Thus, the posterior pdf should be regularly updated.
In order to do so, one usually proceeds in two steps: the analysis and the propagation (or forecast).During the analysis step, a background pdf is used as a prior together with the observation likelihood to construct the (often approximate) posterior pdf, following Bayes' theorem.During the propagation step, this posterior pdf is propagated in time with the model to yield the prior pdf of the next assimilation cycle.
In general these posterior and prior pdfs are not easily computable.In the Kalman filter, assumptions are notably made on the linearity of operators, to keep these pdfs Gaussian.This way, they are characterized by their mean and covariance matrix.Linear algebra is then sufficient to enforce both Bayes' theorem and the propagation step into operations on means and covariances.
However, with nonlinear models, the Kalman filter assumptions do not hold.The posterior and prior pdfs are not Gaussian anymore.A possibility in this case is to enforce Gaussianity with approximations.This requires the selection of mean and covariances intended for the Gaussian surrogate pdfs.With the Kullback-Leibler divergence, the best Gaussian approximation of a pdf is achieved by equating the mean and covariances (see, e.g., Bishop, 2006).However, the integrations necessary to evaluate these moments are also prohibitive.
In the 4D-Var algorithm (see, e.g., Lorenc, 2014, and references therein), Laplace's approximation gives us a way to work around the problem by replacing the posterior mean with the presumed unique global minimizer of the cost function.A model propagation is then sufficient to estimate the prior pdf mean.This approach calls for efficient global optimization routines.
However, in practice, solving a global optimization problem is challenging when the number of unknowns is large, and local methods like Gauss-Newton are often preferred in practice (see, e.g., Björck, 1996).
Unfortunately, Gauss-Newton methods' ability to locate the global minimum depends on the minimization starting point and on the cost function properties.Furthermore, missing this global minimum is likely to cause a quick divergence (from the truth) of the sequential DA method.Thus, it is critical for the assimilation algorithm to keep the minimization starting point in a global minimum basin of attraction.
This requirement is constraining because, with a chaotic model, the number of local minima may increase exponentially with the data assimilation window (DAW) time extent L (Miller et al., 1994;Pires et al., 1996).Unfortunately, assuming a perfect, chaotic -and hence unstable -model, this is also for the longest time extents that the assimilation performs best.Several strategies have been investigated to go beyond this restriction.Pires et al. (1996) propose the quasi-static (QS) minimization in a 4D-Var context: as the observations are progressively added to the cost function, the starting point (or first guess) of the 4D-Var minimization is also gradually updated.Ye et al. (2015) propose to gradually increase the model error covariances in the weak-constraint 4D-Var cost function in a minimization over an entire trajectory; this way the model nonlinearity is gradually introduced into the cost function (see also Judd et al., 2004).They also propose to parallelize this minimization over multiple starting points to increase the chance to locate the global minimum.
On the one hand, the 4D-Var benefits from the QS approach to approximate the posterior and prior means.On the other hand, with traditional 4D-Var, the prior covariance matrix is taken as static.This is appropriate when, as in Pires et al. (1996), Swanson et al. (1998) or Ye et al. (2015), only one cycle of assimilation is considered.But this limits the dynamical transfer of error statistics from one cycle to the next.
In contrast, 4D ensemble variational (EnVar) schemes allow to both perform a nonlinear variational analysis and a propagation of dynamical errors via the ensemble (see Chapter 7 of Asch et al., 2016).The improvement brought by QS minimizations on these schemes has been suggested and numerically evaluated in Bocquet andSakov (2013, 2014); Goodliff et al. (2015).This motivates a more complete analytical and numerical investigation.
The iterative ensemble Kalman smoother (IEnKS) (Bocquet and Sakov, 2014;Bocquet, 2016) is the archetype of such 4D nonlinear EnVar scheme, where the ensemble parts of the algorithm are deterministic.Using low-order models (usually toymodels), it was shown to significantly outperform 4D-Var, the EnKF or the ensemble Kalman smoother in terms of accuracy.
The IEnKS improves the DA cycling by keeping track of the pdfs mean and covariance matrix.To do this, Laplace's approximation is used to replace the posterior mean and covariance matrix with the minimizer of the cost function and an approximation of the inverse Hessian at the minimizer, respectively.These moments are then used to update the ensemble statistics.The updated ensemble is then propagated to estimate the prior mean and covariance matrix.Hence, it is also critical for the IEnKS to locate the global minimum of the cost function.
Here, we are interested in the application of the QS minimization to the IEnKS.It will be shown that the bigger the DAW, the better the performance as long as the global minimum is found.Because the QS minimization improves this detection, the IEnKS will benefit from it.
The rest of the paper is organized as follows.In section 2, the performance dependency of 4D-Var and IEnKS algorithms on the DAW parameters is investigated.In order to do so, a brief presentation of 4D-Var and the IEnKS algorithms is given.Then we define a measure of performance for assimilation algorithms.This definition is used to give analytic expressions for the accuracy of both algorithms with a linear, diagonal, autonomous model.This quantifies the impact of cycling on the algorithms.
After these preliminaries, the nonlinear, chaotic case is studied.In section 3, we provide and describe the algorithms of the quasi-static IEnKS (IEnKS-QS).Section 4 is dedicated to numerical experiments with two Lorenz low-order models and to the improvement of the IEnKS-QS numerical efficiency.Conclusions are given in section 5.
We emphasize that the algorithmic developments of this study are not meant to improve high-dimensional, imperfect model, data assimilation techniques.Even if Miller et al. (1994) show some similarities between the perfect and imperfect settings of a model of intermediate complexity, model error would generally forbid the use of very long DAWs as sometimes considered in this study.Instead, the objective of this paper is to better understand the interplay between chaotic dynamics, ensemble variational data assimilation schemes and their cycling, irrespective of whether they could be useful in high-dimensional systems.

The data assimilation window and the assimilation performance
After reviewing 4D-Var (in a constant background matrix version) and the IEnKS algorithms, we will investigate the dependency of assimilation performance on the DAW key parameters.This will illustrate the cycling improvement brought in by the IEnKS compared to 4D-Var.We shall see that, with chaotic models, the longer the DAW is, the better the accuracy of these algorithms.
The evolution and observation equations of the system are assumed of the form: where the unknown state x l at time t l is propagated to t l+1 with the model resolvent M : R m → R m .The model is assumed to be perfect so that there are no errors in Eq. (1b) and autonomous (M does not depend on time).The observation operator H : R m → R d relates the state x l to the observation vector y l .The observation errors (ε l ) l≥0 are assumed Gaussian with mean 0 ∈ R d and covariance matrix R ∈ R d×d ; they are uncorrelated in time.

4D-Var and IEnKS algorithms
Both 4D-Var and the IEnKS use a variational minimization in their analysis step.The objective of this minimization is to locate the global maximum of the posterior pdf p (x 0 |y L:K ) of the system past state x 0 given the observations The system state and observations are seen as random vectors with values in R m and R d , respectively.The posterior pdf quantifies how our knowledge on the state x 0 changes with realizations of y L:K .Thus, its maximum is the most probable state after assimilating the observations.The DAW is displayed in Fig. 1.The parameters K and L are the time index of the DAW first and last assimilated observation batch, respectively.The number of observations is S and satisfies S = L − K + 1 in our case (i.e., no overlap between the DAWs).To specify this posterior pdf, we have to make Figure 1.Schematic of a DAW.The state variable at t0 is x0, the observation vector yK at time tK is the first of the DAW to be assimilated and the observation yL at present time tL is the last one, S = L − K + 1 is the number of observations to assimilate in the current cycle.
These observations, possibly observation vectors, are represented by black dots.
further assumptions on x 0 .
The initial state x 0 is assumed to be Gaussian with mean x b 0 ∈ R m and covariance matrix B ∈ R m×m : With these assumptions, an analytic expression can be obtained for the posterior pdf p (x 0 |y L:K ) at the first cycle, or for the cost function associated with this pdf.The latter is defined as The notation G is used rather than the traditional J to refer to an exact cost function, i.e., a cost function defined from an exact posterior pdf.Bayes' theorem yields: and with the Gaussian assumption on the background and observation errors we have where x 2 A = x T Ax is, the norm of x associated with a symmetric positive definite matrix A; c 0 is a normalization constant ensuring e −G(x0|y L:K ) dx 0 = 1; M l stands for l compositions of M.
The propagation corresponds to a time shift of S time steps.Thus, at the k-th assimilation cycle, the posterior pdf is p (x kS |y kS+L:K ).Using Bayes' theorem the k-th cycle cost function is: where − ln p x kS |y (k−1)S+L:K is the background term.If the model and observation operators are nonlinear, an analytical expression for this prior is not accessible and one needs to approximate it.The 4D-Var and the IEnKS algorithms are solutions based on distinct approximation strategies.
The 4D-Var cost function at the k-th cycle, based on the static error covariance matrix B, is defined by The analysis of 4D-Var consists in minimizing Eq. ( 7), yielding x a kS at t kS .Because this cost function depends on realizations of the random observations, x a kS is also a random variable.This analysis is then propagated at time t (k+1)S with the resolvent of the model to produce the next cycle background state: In general, G (x 0 |y L:K ) and J x 0 ; y L:K , x b 0 only coincide at the first cycle of an assimilation because of the assumption Eq. (2).Subsequently, the background term of the 4D-Var cost function where E and C are the expectation and covariance operators, respectively; xb kS and X b kS are the empirical mean and normalized anomaly of E b kS , respectively: with T ∈ R n a vector of ones and I n is the identity of R n .Note that Eqs (9, 10) are not exact because of sampling errors.If the state vector is of the form: with w kS ∈ R n the control variable in the ensemble space, the IEnKS cost function is defined in the ensemble space by where • = • I with I the identity matrix.The analysis of the IEnKS also consists in minimizing this cost function.It yields an analyzed ensemble E a kS at time t kS verifying: with xa kS and X a kS respectively the empirical mean and normalized anomalies of the analyzed ensemble, w a kS the cost function minimizer, ∇ 2 J w a kS ; y kS+L:kS+K , E b kS its Hessian at this minimum (usually approximated to avoid computations of the model second derivatives) and U ∈ R n×n an orthogonal matrix such that U1 n = 1 n .This analyzed ensemble is also propagated to time t (k+1)S to produce the next cycle background ensemble The cycle is completed by using in the next analysis the cost function Eq. ( 14) with time indexes incremented by S.

Performance of assimilation
In order to evaluate the efficiency of 4D-Var and the IEnKS with DAW parameters S and L, two measures of accuracy are investigated here: the usual empirical RMSE, and a theoretical counterpart.
At the k-th cycle, the algorithm generates at time t kS an analysis x a kS from the observations.This analysis is propagated with the model l steps forward in time to yield the analysis x a kS+l meant to approximate the system true state x kS+l .A traditional measure of an assimilation performance is the root mean square error (RMSE).It is defined by The RMSE takes different names depending on the time it is computed at.
it is the smoothing RMSE with lag L − l.In the following, the smoothing RMSE will correspond to the one with (maximum) lag L.
The RMSE rigorously depends on the random variable realizations, and thus it is also a random variable.In our numerical experiments, as is usually done, the RMSE is averaged over the cycles to mitigate this variability: Let us assume there is a random couple x ∞S+l , x a ∞S+l whose distribution is invariant and ergodic with respect to the shift transformation: Then by Birkhoff's ergodic theorem (see Walters, 1982) the sequence (aRMSE N ) N converges when N → ∞ and its limit aRMSE verifies: where the expectation E is taken over p x ∞S+l , x a ∞S+l .In this case, the aRMSE measures the long time impact of the cycling on the assimilation accuracy.This limit is difficult to exploit algebraically.That is why, in the theoretical developments, we will prefer the expected MSE (eMSE), denoted by P : where the expectation is taken over p x kS+l , x a kS+l .In the following subsection, we will focus on the long term impact of the cycling on P .Simplifying assumptions will be made to express P ∞S+l = lim k→∞ P kS+l as a function of S and L.

Performance in the linear, diagonal, autonomous case
In order to obtain analytical expressions of the eMSE for 4D-Var and the IEnKS, we make drastic simplifying assumptions.
First, the model is assumed to be the resolvent of a linear, diagonal, autonomous ordinary differential equation.Thus, it can be expressed as where M = diag (α i ) i=1..m is diagonal and does not depends on x.We further assume H = hI m , B = bI m , R = rI m , where I m is the identity matrix of R m and h, r, b > 0. With these assumptions, appendix A provides an expression for the 4D-Var asymptotic eMSE in the univariate case.The generalization to the diagonal multivariate case is obtained by summing up the eMSEs of each direction: The case ∆ i ≥ 1 means that too much credit is given to the background variance, which is approximated for all cycles by the constant b in our 4D-Var scheme.Therefore, the information carried by the observations is not sufficient to mitigate the exponential growth of errors in the propagation.Concerning the IEnKS, the anomalies are assumed to be full rank to avoid any complication due to singular covariance matrices.Moreover, the linearity of the model is employed to express the background statistics: This way, sampling errors are avoided as they are not the focus of this study.The background ensemble E b kS becomes a notational shortcut for the pair xb kS , X b kS .This simplified IEnKS is a Kalman smoother (Cosme et al., 2012;Bocquet and Carrassi, 2017).With these assumptions, appendix B gives an expression for the IEnKS asymptotic eMSE in the univariate case.The optimality of the IEnKS eMSE is also proven.The generalization to the diagonal multivariate case is also obtained by summing up the eMSEs of each direction: This expression shows that the eMSE components on the stable directions are null.Indeed one expects the IEnKS to be at least more efficient than a free-run, whose errors in the stable directions tend to zero1 .This is not the case for 4D-Var since the static background covariance matrix introduces spurious variance in the stable directions as seen in Eq. ( 23).In Trevisan et al.
(2010), 4D-Var error variances in the stable directions are forced to zero to improve the accuracy of the assimilation.
In the following, we study the eMSE dependency on the DAW parameters, S and L, for both algorithms.We focus on a bivariate case with α 1 = 1.2, α 2 = 0.8 in order to have one stable and one unstable direction; h = b = r = 1.Using Eqs (23,25), the asymptotic smoothing and filtering eMSEs are displayed as a function of L, S in Fig. 3.The eMSE components on the stable and unstable directions are also shown.Those graphs are interpreted in the following.
Specifically, the eMSE expression for the IEnKS is of the form where c 1 does not depend on S, L, l.The contribution to P IEnKS ∞S+l on the stable direction is zero.It depends only on the lag L − l; it does not depend on S.Moreover, the filtering eMSE (l = L) is constant: the propagation compensates for the analysis.
Concerning 4D-Var, we assume K fixed and S → ∞ to give an asymptotic eMSE expression:  DAW.This improvement dominates the detrimental impact of the Gaussian background approximations.In the filtering eMSE, this improvement is balanced by the propagation of the analysis at the end of the DAW.In Fig. 3, the filtering eMSE stable component is damped such that it has little effect.However, on the unstable component, the parameter S improves the filtering eMSE.The bigger S , the closer P 4D−Var ∞S+l is to P IEnKS ∞S+l , which is optimal (cf.appendix B).A qualitative explanation is that, when S is big, the 4D-Var uses less often the inexact background error variance, making the analysis more trustworthy.to those of Trevisan et al. (2010).Concerning 4D-Var, the eMSE can be written as a function of the lag L − l: Thus, the unstable component of the eMSE is an exponentially decreasing function of the lag and the stable component is an exponentially increasing function of the lag.The sum is therefore decreasing when the unstable component is dominant; it is increasing when the stable component is dominant.
In this section, we have studied the accuracy of both the IEnKS and 4D-Var eMSEs as a function of the DAW parameters under linear, autonomous, diagonal assumptions.We found that the DAW parameter L improves the smoothing eMSE and S improves the filtering eMSE.These properties will be discussed and numerically investigated in a nonlinear context in the next section.

Performance in the nonlinear, chaotic case
The results of section 2.3 foster the use of the largest possible L to improve the 4D-Var and IEnKS smoothing eMSEs.For filtering, the error propagation at the end of the DAW balances the gain in eMSE due to the assimilation of future observations.Thus, the filtering eMSE is not improved by the assimilation of observations distant in time; it is rather affected by the background pdf approximation.To assimilate the same number of observations, algorithms using high values of S need fewer cycles.Therefore, they rely less often on the background approximation.That is why the 4D-Var filtering performance is improved with S. For the IEnKS as in section 2.3, the prior pdf approximation is exact so that there is no dependence on S.This is no longer true in the nonlinear case.Hence, with a similar reasoning, one actually expects an improvement on the IEnKS filtering performance with S. As a matter of fact, it has been shown that with a nonlinear chaotic model, the filtering accuracy increases with L in most cases (see Bocquet and Sakov, 2014, and section 4 of the present paper).
However, with a chaotic model, Pires et al. (1996) showed that the 4D-Var cost function number of local extrema increases with L, making minimization problematic.We show in this section that the IEnKS cost function suffers from the same problem.
This behavior will be illustrated with the Lorenz 95 (L95) model (Lorenz and Emanuel, 1998).It represents a mid-latitude zonal circle of the atmosphere and is described by a set of m nonlinear differential equations: where x j is the j-th modulo m component of x, m = 40 and F = 8.This equation is integrated using a fourth-order Runge-Kutta scheme with a time-step of δt = 0.05.The dynamics of L95 are chaotic; the L95 largest Lyapunov exponent is λ 1.7.This hilly shape causes minimization problems.A possible minimization procedure for the IEnKS is the Gauss-Newton (GN) algorithm (e.g., Björck, 1996).GN is not a global procedure meaning that, depending on the starting point and the cost function properties, the algorithm can converge towards any local extremum, take many iterations or even diverge.However, if the cost function is quadratic, the global minimum is reached in one iteration.The cost function non-quadraticity is induced by the model nonlinearity.In the following, we will give a heuristic argument yielding a bound on the S parameter beyond which the GN method probably misses the global minimum in the configuration First, the GN convergence properties are drastically simplified.We assume the method converges to the global minimum if and only if the minimization starting point is in a neighborhood of the global minimizer where the IEnKS cost function is almost quadratic.
Unfortunately, this minimizer is unknown because the cost function depends on realizations of many random variables.In order to eliminate this variability, Pires et al. (1996) introduced a so-called error-free cost function.We will rather use an Relying on an ergodicity assumption, appendix C proves that this averaged cost function verifies where the ergodic random variables x∞S , X b ∞S and x ∞S have been defined in section 2.2 and Appendix C. As seen in Eq. ( 31), a sufficient condition for the starting point w = 0 to be in a neighborhood of the global minimizer where the cost function is assumed almost quadratic is to require that xb ∞S be in a neighborhood of x ∞S where all the M l K≤l≤L are almost linear.In the univariate case, if the model behavior is almost linear and unstable, we can use Eq. ( 25) to estimate the terms in the sum in Eq. ( 31) at the starting point w = 0: where α is the model linear part and the extra S accounts for the propagation.But in a necessarily bounded physical system, the right-hand side of Eq. ( 32) cannot grow indefinitely with l + S. Such model saturation imposes with B a bound.Hence, Eqs (32,33) yield the following inequality on S: We choose l = L because it corresponds to the most constraining case.When Eq. ( 34a) is violated, xb ∞S departs from x ∞S such that the nonlinearities of M L are significant.
To apply this inequality to the L95 model we choose: σ being the mean of the singular values greater than 1.This corresponds to an average of the error amplification by dM dx (x ∞S ) in the local unstable subspace.We also choose for the bound B the average squared norm between two long trajectories: This quantity is greater than E δx b ∞S+l 2 because the IEnKS asymptotic performance is at least better than a free run.
From the values of α and B we find S max = 14.
Figure 6 shows the filtering and smoothing aRMSEs of an IEnKS L = S with L95 as a function of S, the performance strongly deteriorates for S > 16, which is remarkably consistent with our estimation.with the linear case: the IEnKS filtering aRMSE depends on S. The former discussion on the local extrema explains this dependency for large values of S.
However, for small values of S, the decreasing aRMSE has not been explained.This is a consequence of the Gaussian background approximation.At each cycle, the IEnKS uses the background ensemble E b kS to estimate the first two moments of the background pdf and makes the approximation: Because the model is nonlinear, this pdf is unlikely to be Gaussian.Therefore, this approximation results in a loss of information and the more it is used, the farther from G the IEnKS cost function is.This is exactly what happens when S is small: to assimilate the same number of observations, the IEnKS uses more cycles so that it relies more on the Gaussian background approximation. the first minimization is performed using the cost function with S = 1 then S is increased and another minimization can be performed with the former minimizer as the new starting point.The process is then repeated until S reaches its final value.
This procedure is directly applicable to the IEnKS cost function minimization.The left panel in Fig. 7 is a schematic of a QS minimization and Algorithm 1 gives the pseudo-code of an IEnKS with a QS minimization.The new parameters (L q ) q<N Q control the number of observations added at each minimization.The three first lines initialize the minimization starting point, the ensemble mean and anomaly matrix.The for loop in lines 4-23 repeats the QS minimization.The while loop in lines 6-22 is the Gauss-Newton minimization.Lines 7 and 8 center the ensemble on the current minimizer.Lines 9 and 10 initialize the cost function gradient and the approximate Hessian.The for loop in lines 12-18 construct observation terms of the cost function gradient and approximate Hessian.Lines 13 and 14 use a finite difference formula to compute the tangent linear and adjoint of w → H • M l xb 0 + X b 0 w .Lines 15 and 16 use this adjoint to update the gradient and approximate Hessian.Lines 19 and 20 solve the linear system of the Guauss-Newton algorithm to update the current minimizer.When GN convergence is reached, this minimizer will be used as a starting point for the next QS minimization.Line 24 updates the ensemble.Line 25 propagates the updated ensemble to the next assimilation cycle.
In Fig. 5, the QS scheme corresponds to minimizing a cost function in the bottom row then using its minimum as a starting point for the minimization of the top right cost function and so on.
Note that Bocquet and Sakov (2014) already qualified the SDA IEnKS with S = 1 as QS.The reason is that, at each propagation, only one vector of observations enters and only one leaves the DAW, slightly deforming the cost function.It is also the easiest way to ensure inequality (34a).However, this method has been shown to be suboptimal because of the Gaussian background approximations.
An alternative to keep S = 1 with many observations in the DAW is to relax the condition K = L − S + 1.This way, observations are assimilated several times.This is done in the MDA IEnKS (Bocquet and Sakov, 2014).But to keep the statistics consistent at least in the linear/Gaussian case, the observations error covariances are adequately altered.Because S = 1, the cost function is slightly modified between each assimilation.That is why the MDA IEnKS is qualified as QS.
However, the multiple assimilation of the observations introduces spurious correlations in the nonlinear/non-Gaussian case, which entail sub-optimality.A scheme similar to the IEnKS-QS has also been successfully used in Carrassi et al. (2017) to compute model evidence.Indeed, the efficient computation of model evidence as an integral over the state space depended on the proper identification of a global maximum of the integrand.However, its implementation was based on the update of EK := M K (E0) 12: for l = K, ..., Lq do 13: ȳl := H (E l ) 1/n 14: end for until ||δw|| ≤ δ or j ≥ jmax 23: end for the ensemble whenever an observation batch is added to the cost function, which is not as numerically efficient as the scheme presented here.
The success of the QS minimization lies in the fact that, when an observation is successfully assimilated, the eMSE is reduced.Thus, the analysis probability mass concentrates around the true state.The analysis is then more likely to be in a neighborhood of the true state where the model is linear.The cost function non-quadraticity can then be increased by adding a 5 new term in it.This is confirmed by the following argument.Let P (q) be the IEnKS asymptotic eMSE at the q-th step of a QS 17 minimization.With the notations and assumptions of section 2.3, i.e. in a linear context, we have the recurrence relation: Thus, P (q+1) < P (q) and we can increase the cost function non-quadraticity by adding new terms in it as long as the propagation of errors does not exceed the bound This implies which yields S ≤ N Q S max .Therefore, the QS minimization allows for a N Q times longer DAW.
Unfortunately, these QS minimizations are very expensive.Indeed, they add a third outer loop repeating N Q GN minimizations.The GN iterations used to compute the intermediate starting points give unnecessary precision; all that is required for these starting points is to be in a neighborhood of x ∞S where the model is almost linear.Thus, one can restrain the number of intermediate GN loops and save the full convergence to the last minimization.This is done in the quasi-convergent IEnKS (IEnKS-QC) with the parameters (j q ) q<N Q .They correspond to the numbers of GN loops in the intermediate QS minimizations.They are typically equal to 1 except for the last one.Algorithm 2 gives the pseudo code of the IEnKS-QC and the right panel in Fig. 7 is a schematic for it.
4 Numerical experiments with low-order models In the following, we perform numerical experiments with the Lorenz 1963 (L63) and Lorenz 1995 (L95) models.L95 has already been presented in section 2.4.L63 (Lorenz, 1963) is a simplified model for atmospheric convection.It is defined by the ordinary differential equations: These equations are integrated using a fourth-order Runge-Kutta scheme with a time-step of δt = 0.01 and (σ, ρ, β) = (10, 28, 8/3).
The dynamics of the L63 model are chaotic, with a largest Lyapunov exponent given by λ 0.91.
Both models are assumed perfect.The truth run is generated from a random state space point.The initial ensemble is until ||δw|| ≤ δ or j ≥ jq 23: end for are generated from the truth with H = R = I m every ∆t = 0.05 for L95 and every ∆t = 0.02 for L63.A burn-in period of 5 × 10 3 × ∆t is enforced in both cases.
Finally the aRMSE is averaged over a number of cycles which is determined by the number of observations assimilated.We use 5 × 10 5 observation vectors for L95 and 5 × 10 6 observation vectors for L63.Unlike Goodliff et al. (2015), our numerical minimizer:  cost.However, for L > 40 with the L95 model, the quasi-static approach cannot sustain the non-linearity anymore and the IEnKS-QS aRMSE degrades.Hence, the IEnKS-MDA S = 1 has still the best performance.
Figure 11 compares the smoothing aRMSE, the filtering aRMSE and the number of ensemble propagations of the IEnKS-QS and IEnKS-QC as a function of the N Q parameter.The IEnKS-QS aRMSE decreases quickly after a point for both algorithms and for both models.For the IEnKS-QS with the L95 model, this point can be estimated using results of section 3 by S/S max 3.6, in remarkable agreement with the experiments.This point comes later for the IEnKS-QC but it demands less ensemble propagations making this algorithm numerically more efficient than the IEnKS-QS.

Conclusions
In this paper, we have extended the study of Pires et al. (1996) on quasi-static variational data assimilation, focused on 4D-Var technique, to cycled data assimilation schemes and specifically four-dimensional nonlinear ensemble variationnal techniques, 10 an exemplar of which being the iterative ensemble Kalman smoother (IEnKS).
The long term impact of cycling has been first investigated theoretically in a linear context for 4D-Var and the IEnKS, then numerically for the IEnKS in a nonlinear context.The way information is propagated between data assimilation cycles indeed makes up for the difference between 4D-Var and the IEnKS.Both reveal performance improvements with the DAW parameter S, which counts the number of observation batches within the DAW.This is a consequence of the Gaussian background approximation: the larger S is, the less the assimilation relies on it.
However, it is observed that this improvement has a limit in the chaotic, perfect model case.The cost function global minimum basin of attraction appears to shrinks with L. This causes the Gauss-Newton procedure to miss the cost function global minimum, which deteriorates the assimilation performance.
Quasi-static minimizations lead slowly but surely to the global minimum by repeated cost function minimizations.As the DAW length L is gradually increased, the starting point of the minimization remains in the global minimum basin of attraction.
For most S, L couples, the quasi-static IEnKS turns out as a more accurate substitute for the multiple data assimilation IEnKS (IEnKS-MDA).
Unfortunately, this method adds an outer loop which could significantly increase the numerical cost.Precision on intermediate minima being superfluous, one can limit the intermediate Gauss-Newton number of loops.The unavoidable space increments required to minimize the non-quadratic cost function are thus reported in time in the quasi-convergent IEnKS.
We did not focus on the applicability of the methods to high-dimensional imperfect models.In particular, we consider very long DAWs, which, even if of high mathematical interest or for low-order reliable models, is less relevant for significantly noisy models.An extension to this work would therefore consists in investigating the same ideas but using weak constraint 4D-Var and IEnKS (Trémolet, 2006;Sakov et al., 2017).Then, from Eq. ( 16) we get the recurrence relation: Thus, X a kS is not random and P IEnKS kS = (X a kS ) 2 .This last equation with Eq. (B8a) tell that the sequence of inverse IEnKS eMSEs is arithmetico-geometric: where the notation Eq. (A7) has been used.Properties of arithmetico-geometric sequences allow to obtain the IEnKS asymptotic eMSE: and the generalization to asymptotic eMSEs with lag L − l is straightforward: Let us now show that the IEnKS eMSE is optimal.Let x a kS (y kS+L:K ) be the 4D-Var analysis or any other function of y kS+L:K .A bias-variance decomposition (e.g., Bishop, 2006)  In the multivariate, diagonal case the algebra can be conducted on each direction independently.Thus, the eMSE in this case is the sum of the univariate eMSEs of each direction.

1 2 x b kS − x kS 2 B
−1 is a Gaussian approximation of the exact background term − ln p x kS |y (k−1)S+L:K .By definition, the background error covariance matrix B of the traditional 4D-Var cost function is the same for each cycle.This is not the case for the IEnKS.Nonlin.Processes Geophys.Discuss., https://doi.org/10.5194/npg-2017-65Manuscript under review for journal Nonlin.Processes Geophys.Discussion started: 16 November 2017 c Author(s) 2017.CC BY 4.0 License.The IEnKS (Bocquet and Sakov, 2014) is an ensemble method with a variational analysis.Two versions of the algorithm exist: the singular data assimilation (SDA) version where observations are assimilated only once; the multiple data assimilation (MDA) version where they are assimilated several times.We focus on the SDA version in the theoretical development.The MDA version is mentioned in the numerical experiments.At the k-th cycle of the IEnKS, the background ensemble at t kS is obtained by a propagation from the previous cycle.The ensemble members are the columns of the matrix E b kS , which is seen as a random matrix with values in R m×n .It is used to estimate the prior mean and covariance matrix:

Figure 2 .
Figure 2. Chaining of the 4D-Var and IEnKS two first cycles with S = 3, L = 4.The first 4D-Var analysis use the background x b 0 at t0 and the observations yL:K to produce the analysis x a 0 at t0.It is propagated S steps forward in time to produce the new background at time tS where another analysis can be performed.The IEnKS does the same but with an ensemble.The dashed rectangle symbolizes the current DAW, black dots represent the observations assimilated in the current cycle, gray dots represent already assimilated observations and white dots represent observations not assimilated.

where c 2
is constant with S, l and o (1) → 0 when S → ∞.The unstable component is close to the IEnKS overall eMSE.The biggest difference with it concerns the eMSE on the stable component.The inexact background variance modeling adds to the eMSE a detrimental term.

Figure 3 .
Figure 3.The asymptotic smoothing (on the left column) and filtering (on the right column) eMSEs of 4D-Var and the IEnKS.They are displayed as functions of L, S with their components on the stable, unstable subspaces.The L parameter is on the abscissa axis for the 4D-Var and on the ordinate axis for the IEnKS.The S parameter is on the abscissa axis for the IEnKS and on the ordinate axis for the 4D-Var.

Figure 4
Figure 4 displays the asymptotic eMSEs of both algorithms as a function of the lag for S = L = 5.These curves are similar

Figure 4 .
Figure 4.The asymptotic eMSEs as a function of the lag (lag = 0 is the filtering performance, lag = L is the smoothing performance).The superposing curves have been slightly translated for better readability, S = L = 5, h = b = r = 1.

Figure 5
Figure5shows a typical IEnKS cost function profile in one direction of the analyzed ensemble space for multiple values of the DAW parameters.The system is observed at every time step and H = B = R = I m .The curves have more and more local extrema when L increases.The curves with the highest amplitudes of the ripples are found for small values of S. Indeed, an averaging effect may settle in as the number of observations increases.

Nonlin.Figure 5 .
Figure 5. Cost functions of the IEnKS projected in one direction of the analyzed ensemble (hence centered and normalized) with various DAW parameters S, L. A quasi-static minimization can be visualized from these panels.It begins with the bottom-left cost function; the orange dot is the starting point and the green one is at the minimum.From the bottom-left cost function up to the top-right cost function, batches of 9 then 10 observation vectors are progressively added to the DAW, and the minimizer (green dot) is updated accordingly.

Figure 6 .
Figure 6.Smoothing aRMSE (top) and filtering aRMSE (bottom) of a Gauss-Newton IEnKS L = S as a function of S with L95 model over 5 × 10 5 cycles (logarithmic scale).We used the finite-size version (Bocquet, 2011) to account for sampling errors and avoid the need for inflation, H = R = B = Im, n = 20.
Nonlin.Processes Geophys.Discuss., https://doi.org/10.5194/npg-2017-65Manuscript under review for journal Nonlin.Processes Geophys.Discussion started: 16 November 2017 c Author(s) 2017.CC BY 4.0 License.3Quasi-static algorithmsWe have seen in the previous section that the effective DAW length should be limited by the cost function non-quadraticity.In this section we review and propose algorithms able to overcome these minimization issues and reach longer DAWs.Quasi static algorithms have been introduced byPires et al. (1996) in a 4D-Var context.The idea behind quasi-staticism is to control the way observations are added to the cost function in order to keep the minimization starting point in the basin of attraction containing the global minimizer.The method consists in repeating the minimization with increasing parameter S: Nonlin.Processes Geophys.Discuss., https://doi.org/10.5194/npg-2017-65Manuscript under review for journal Nonlin.Processes Geophys.Discussion started: 16 November 2017 c Author(s) 2017.CC BY 4.0 License.Algorithm 1 One cycle of the IEnKS-QS Require: E b 0 the background ensemble at t0; λ the inflation; (Lq)q<N Q a list of DAW time indexes; ε the finite differences step; δ, jmax GN end of loop parameters; 1 = (1, . . ., 1) T ∈ R n and I is the identity of R n .

Figure 7 .
Figure7.Schematics of an IEnKS-QS minimization (on the left) and a IEnKS-QC minimization (on the right).The rectangle contains the observations to be assimilated, the black dots represent the observations used in the current minimization.The "Gauss-Newton" surrounded by arrows represents the iterations of the Gauss-Newton procedure.The number of quasi-static steps is NQ = 3.The flow of observations is controlled by the parameters (L0, L1, L2) =(3,5, 6).For the QC IEnKS, the number of GN iterations is controlled by the (j0, j1, j2) parameters.

Figure 10 Figure 8 .
Figure10compares the smoothing aRMSE, the filtering aRMSE and the number of ensemble propagations of the IEnKS-QS (N Q = S, S = L) and the IEnKS-MDA (S = 1), for both L63 and L95 as a function of L. For L < 40, the IEnKS-QS has smaller aRMSEs than the IEnKS-MDA.The IEnKS-QS is SDA so it does not suffer from suboptimality related to multiple assimilations.Moreover, the IEnKS-QS always requires less propagations of the ensemble which improves the computational 10

Figure 9 .
Figure 9. IEnKS (lower triangles) and IEnKS-QS (upper triangles, NQ = S) smoothing and filtering aRMSEs as a function of L and S with the L63 model.The L parameter is on the abscissa axis for the IEnKS and on the ordinate axis for the IEnKS-QS.The S parameter is on the abscissa axis for the IEnKS-QS and on the ordinate axis for the IEnKS.

Figure 10 .
Figure 10.IEnKS-MDA (S = 1), IEnKS-QS and IEnKS (S = L, NQ = S) filtering and smoothing aRMSEs and number of ensemble propagations as a function of L for the L95 and L63 models (logarithmic scale).The ensemble propagations are in units of ∆t.For instance, with L = 50, a single ensemble propagation through the DAW counts for 50.
of this estimator yields:E x kS ,y kS+L:K (x kS − x a kS ) =E y kS+L:K [V x kS [x kS |y kS+L:K ]] + E y kS+L:K (E x kS [x kS |y kS+L:K ] − x a kS ) 2