Joint state and parameter estimation with an iterative ensemble Kalman smoother

. Both ensemble ﬁltering and variational data assimilation methods have proven useful in the joint estimation of state variables and parameters of geophysical models. Yet, their respective beneﬁts and drawbacks in this task are distinct. An ensemble variational method, known as the iterative ensemble Kalman smoother (IEnKS) has recently been introduced. It is based on an adjoint model-free variational, but ﬂow-dependent, scheme. As such, the IEnKS is a candidate tool for joint state and parameter estimation that may inherit the beneﬁts from both the ensemble ﬁltering and variational approaches. In this study, an augmented state IEnKS is tested on its estimation of the forcing parameter of the Lorenz-95 model. Since joint state and parameter estimation is especially useful in applications where the forcings are uncertain but nevertheless determining, typically in atmospheric chemistry, the augmented state IEnKS is tested on a new low-order model that takes its meteorological part from the Lorenz-95 model, and its chemical part from the advection diffusion of a tracer. In these experiments, the IEnKS is compared to the ensemble Kalman ﬁlter, the ensemble Kalman


Introduction
Data assimilation in geophysics is often concerned with the estimation of the state of the system (e.g. atmosphere, ocean). Yet, non-observed parameters of the model can also be seen as control variables. They can indirectly be estimated through the assimilation of observations. In such context, data assimilation can be a powerful inverse modeling tool.
With the progress in techniques as well as the rise in popularity of data assimilation in geosciences, this topic has become of increasing interest. Parameter estimation is useful because it can account for model error through a parametric representation of the uncertain processes, and could serve as a tool to enhance the system state estimation. For instance, it is now accepted that air quality forecasting can benefit considerably from the online estimation of forcing parameters. Parameter estimation is also a fundamental tool per se in the estimation of the parameters which are often of physical or societal interests. For instance, again regarding air quality, data assimilation can help assess effective kinetic rates of interest to chemists, or it can help assess regulated pollutant emissions of interest to policy makers.

Data assimilation techniques for parameter estimation
As is the case with data assimilation for state estimation, two types of approach have been used for parameter estimation: filtering methods and variational methods. The estimation of parameters by the filtering approaches is based on the augmentation of the state vector with the parameter variables. If the state space has dimension M and if the number of parameters is P , then the augmented control vector has dimension M + P . Through the assimilation of observations, the joint analysis of the state variables and the parameters aims at building covariances (or higher-order dependencies for non-Gaussian filters) between them; these are crucially needed because of the non-observability of most parameters. The augmented state principle is likely to be used with any type of filter: extended Kalman filters (e.g., Kondrashov et al., 2008), ensemble Kalman filters (e.g., Aksoy et al., 2006;Wirth and Verron, 2008;Barbu et al., 2009), particle filters (e.g., Vossepoel and van Leeuwen, 2007;Weir et al., 2013), and stochastic sampling and genetic algorithms (e.g., Jackson et al., 2004;Liu et al., 2005;Bocquet, 2012;Posselt and Bishop, 2012). In an enlightening review, Ruiz et al. (2013) have discussed the use of ensemble Kalman filters (EnKFs) for parameter estimation. When the filtering method accounts for asynchronous observations by building covariances between parameter errors defined at distinct times, the method is usually referred to as a smoother (Evensen, 2003;Hunt et al., 2004;Sakov et al., 2010;Cosme et al., 2010).
The estimation of parameters with the variational approach is based on the explicit dependence of the cost function in not only the state variables, but also the parameters. If the dependence is not explicit, one should at least be able to compute the gradient of the cost function with respect to the parameters. The four-dimensional variational method, or 4D-Var (Le Dimet and Talagrand, 1986;Talagrand and Courtier, 1987;Rabier et al., 2000), has the distinct advantage of being a natural smoother since it works within a temporal window to assimilate asynchronous observations. However, it requires the use of the adjoint evolution model to compute gradients of the cost function. Computing the gradient with respect to the model parameters requires the same adjoint model, and also the extra effort of computing the explicit derivative of the cost function with respect to the parameters, in terms of the adjoint variables. It has been used for parameter estimation by, e.g., Pulido and Thuburn (2006), Bocquet (2012), and Kazantsev (2012).
This list of contributions to the field is far from being exhaustive and merely illustrates some of the methodologies used in atmospheric, ocean and climate sciences. In particular, there is a vast literature in atmospheric chemistry dedicated to the inversion of sources of pollutants and tracers. The extended and ensemble Kalman filters and variational methods (3D-Var and 4D-Var) have been employed in this field for over two decades (Zhang et al., 2012, and references within). Owing to the (quasi-)linearity of some chemical species, simpler four-dimensional smoothing analysis merely using a Best Linear Unbiased Estimator (BLUE) have also been extensively used to estimate sources.

The iterative ensemble Kalman smoother
The iterative ensemble Kalman smoother (IEnKS) has been recently proposed (Bocquet and Sakov, 2013) as an extension of the iterative ensemble Kalman filter Bocquet and Sakov, 2012). It is meant to solve the variational problem of 4D-Var with the help of a 4D ensemble. As such, it is a 4D ensemble variational method of the type used in the work by Buehner et al. (2010), Chen and Oliver (2012) and Fairbairn et al. (2013), and has (more remote) connections with ensemble of variational methods (Raynaud et al., 2009;Bowler et al., 2013). It does not require the use of the adjoint observation and evolution models since the sensitivities are estimated with the ensemble (Gu and Oliver, 2007;Liu et al., 2008). Moreover, the IEnKS generates the posterior ensemble using Gaussian assumptions and forecasts the ensemble to the next update step in the same way an ensemble square root Kalman filter does. Note that the scheme is not a hybrid method since it does not combine two distinct methods.
Because the IEnKS fundamentally solves a variational problem, it may require iterations for the cost function minimization. The number of iterations depends on the nonlinearity of the system. This number is expected to be small (1 or 2) for weak nonlinearity (typical of synoptic scale meteorology).
Using perfect model assumptions, Bocquet and Sakov (2013) have tested the IEnKS on two low-order models in different regimes representing different nonlinearities and lengths of the data assimilation window (DAW). The IEnKS (often significantly) outperforms EnKF and the standard ensemble Kalman smoother (EnKS) in all these regimes, not only regarding the smoothing performance (retrospective state estimation) but also regarding the filtering performance (state estimation at present and future time). Here, we will also show that the IEnKS also outperforms 4D-Var in this context.
In addition, the IEnKS has been shown on these models to be able to handle long DAWs, especially when assimilating observations several times (in a mathematically consistent manner).
Because the IEnKS offers the advantages of both filtering and variational methods, and because it is capable of operating on long DAWs, it has considerable potential as an efficient parameter estimation method.

Objective and outline
The objective of this article is to introduce a straightforward extension of the IEnKS to joint state and parameter estimation, and to test the potential of the approach on low-order models. The physical context is that of chaotic geophysical models, and of atmospheric chemical/tracer models, in which a joint state and parameter estimation is, in our opinion, a key to successful forecasts.
The algorithm of the IEnKS will be described in Sect. 2, in a compact but comprehensive manner. The method will then be generalized to joint state and parameter estimation. In Sect. 3, the capabilities of the IEnKS on the Lorenz-95 model (Lorenz and Emmanuel, 1998) will be reported. Additional tests will be performed: a comparison with the state-of-theart EnKF and standard EnKS, as well as with a 4D-Var, and with a new cycling of the IEnKS DAWs. Then the IEnKS will be tested for joint state and parameter estimation on the Lorenz-95 model (Lorenz and Emmanuel, 1998). In Sect. 4, an original extension of the Lorenz-95 with the advection of a tracer will be introduced. It is meant to represent the dynamics of an online atmospheric chemistry model, or meteorological models with a constituent such as moisture, with two unobserved parameters: the Lorenz-95 forcing parameter, and the emission flux. The IEnKS, the EnKF/EnKS, and a 4D-Var will be tested and compared in this context. The results will be discussed in Sect. 5. Conclusions will be drawn in Sect. 6.
2 The iterative ensemble Kalman smoother for joint state and parameter estimation

The algorithm
A Bayesian derivation of the IEnKS can be found in Bocquet and Sakov (2013). However, we would like to introduce the IEnKS comprehensively in this article: reference to Bocquet and Sakov (2013) will only be made regarding details that are not directly relevant to this study. Here, we describe the algorithm with its main justifications, and then provide its pseudo-code.

The core algorithm
Observation vectors y ∈ R d are assumed to be collected every time step t. Time is discretized into the times t k when the observations are collected. The number d of scalar observations within y can be time-dependent. The observations are related to the state vector through a possibly nonlinear, possibly time-dependent observation operator H k . The observation errors are assumed to be Gaussian-distributed, unbiased, and uncorrelated in time, and to have an observation error covariance matrix R k . The analysis step of the assimilation scheme is performed over a window of length L t in time units. Unless otherwise stated, time index k is relative to present time. With this convention, present time is set to be always t L , so that the initial condition of the DAW is conveniently always t 0 .
Let us first describe the update step. At t 0 (i.e. L t in the past), the background is obtained from an ensemble of N state vectors of R M : x 0,[1] , . . . , x 0,[n] , . . . , x 0, [N] . Index 0 refers to time while [n] refers to the ensemble member index. They can be stored in a matrix E 0 = [x 0,[1] , . . . , x 0, [N ] ] ∈ R M×N . One can equivalently represent the ensemble with its mean x 0 = 1 N N n=1 x 0,[n] and its anomaly matrix As in the ensemble Kalman filter, this background is approximated as a Gaussian distribution of mean x 0 , and covariance matrix A 0 A T 0 /(N − 1), the first-and second-order empirical moments of the ensemble. The background is rarely full rank since the anomalies of the ensemble span a vector space of dimension smaller than or equal to N − 1 and in a realistic context N ≪ M. Therefore, one solves for the analysis state vector x 0 in the ensemble space x 0 + Vec x [1] − x 0 , . . . , x [N] − x 0 , which can be written x 0 = x 0 + A 0 w, where w ∈ R N is a vector of coefficients in ensemble space. The analysis of IEnKS over [t 0 , t L ] is obtained from a cost function. The restriction of this cost function in state space to the ensemble space yields: The tilde symbol signifies that J is a mathematical object defined in ensemble space. M k←0 is the possibly nonlinear transition operator from t 0 to t k . {β k } 1≤k≤L are scalars in [0, 1] that weight the observations within the DAW. The choice of the β k can be made mathematically consistent and can have dramatic consequences on the performance of the data assimilation system. We refer to Bocquet and Sakov (2013) for a justification and numerical tests. Nonetheless, the rational for the choice of the {β k } 1≤k≤L will be discussed later.
This cost function is iteratively minimized in the ensemble space following the Gauss-Newton algorithm: using the gradient ∇ J (j ) and an approximate Hessian H (j ) of the cost function: x (j ) H (j ) is an approximation of the full Hessian because it disregards the contribution of the second-order derivatives of the innovation vectors δ k (w) in the cost function. The notation (j ) refers to the iteration index of the minimization. At the first iteration one sets A 0 is the tangent linear of the operator from ensemble space to the observation space. The estimation of this sensitivity using the ensemble is what allows one to avoid the use of the model adjoint. Two implementations, referred to as the transform and the bundle variants, have been put forward Bocquet and Sakov, 2012). With the bundle scheme, for instance, the ensemble is rescaled closer to the mean trajectory by a factor ε. It is then propagated through the model and the observation operators, after which it is rescaled back by the inverse factor ε −1 . The operation reads: where 1 = (1, . . . , 1) T ∈ R N . Note that each iterative update Eq.
(2) solves the inner quadratic variational problem: where z 2 G = z T G −1 z. The iteration is stopped when w (j ) − w (j −1) becomes smaller than a predetermined threshold e. Let us denote w ⋆ the solution of the cost function minimization. The symbol ⋆ will be used with any quantity obtained at the minimum. Subsequently, a posterior ensemble can be generated at t 0 : where U is an orthogonal matrix that is arbitrary but satisfies U1 = 1 -meant to keep the posterior ensemble centered on the analysis -and x ⋆ 0 = x 0 + A 0 w ⋆ . The Gauss-Newton minimization scheme shown in Eq.
(2) can easily be replaced by a quasi-Newton scheme that avoids the computation of the Hessian, or by a Levenberg-Marquardt algorithm that guarantees convergence of the minimization. These alternatives have been suggested and successfully tested in Bocquet and Sakov (2012). In the context of the standard models tested in Sects. 3 and 4, the nonlinearity is mild enough that a Levenberg-Marquardt scheme is unnecessary, and the Gauss-Newton scheme is very efficient.
This ends the part of the analysis step that is required to cycle the data assimilation scheme. An optional analysis step is required when a state estimation is desired at times t 1 , . . . , t L , i.e. up to present time, or when a forecast to future times is desired. This additional step depends on the choice of the β k and whether the DAWs are overlapping. In the simplest case, when observations are assimilated once and only once, this subsequent analysis takes the form of a forecast of the mean state x ⋆ k = M k←0 (x ⋆ 0 ), or a forecast of the full ensemble if one is additionally interested in estimating forecast uncertainty E ⋆ k = M k←0 (E ⋆ 0 ). During the forecast step of the scheme cycle, not to be confused with the forecast of the analysis step we just mentioned, the ensemble is propagated for S t, with S an integer:  Fig. 1. Chaining of the SDA IEnKS cycles. The schematic illustrates the case L = 5 and a shift of S = 2 time intervals t is applied between two updates. The method performs a smoothing update throughout the window but only assimilates the newest observations vectors (that have not been already assimilated) marked by black dots. Note that the time index of the dates and the observations are absolute, not relative, for this schematic.
If the optional analysis step implied forecasting the ensemble to or beyond t S , then there is no need to forecast it again. This ensemble at t S will form the background for the next analysis.
A typical chaining of the analysis and forecast steps is schematically displayed in Fig. 1.
A pseudo-code of the IEnKS is displayed in Algorithm 1. It does not show the optional analysis step, since the cycling of data assimilation does not depend on it. It is the same as the one presented in Bocquet and Sakov (2013), except that here it is given in the general case, 1 ≤ S ≤ L, rather than the specific case S = 1. The pseudo-code accounts for the possible use of inflation (lines 20, 21).
In summary, the IEnKS solves the variational problem of 4D-Var in the ensemble range. Because the variational problem is solved in a reduced space, there is no need for the adjoint evolution and observation models. The IEnKS generates and propagates the posterior perturbations following the scheme of the ensemble Kalman filter. As such, it uses sampled errors of the day.

Single and multiple assimilation of observations
There are some degrees of freedom in the choice of L, S and the {β k } 1≤k≤L . Let us just mention a few legitimate choices.
Firstly, for any L and S, such that 1 ≤ S ≤ L, the most natural choice for the {β k } 1≤k≤L is β k = 1 for k = L − S + 1, . . . L, and β k = 0 otherwise. That way, the observations are assimilated once and only once. We call this the single data assimilation scheme (SDA IEnKS). It is simple, and the optional analysis of the update step is merely a forecast of the analyzed state at t 0 , or possibly a forecast of the full ensemble from t 0 . When S = L, the DAWs do not overlap, but they do so when S < L. The chaining of the data assimilation cycles in the SDA case is displayed in Fig. 1.
Require: t L is present time. Transition model M k+1←k , observation operators H k at t k . Algorithm parameters: ǫ, e, j max . E 0 , the ensemble at t 0 , y k the observation at t k . λ is the inflation factor. U is an orthogonal matrix in R N×N satisfying U1 = 1. β k , 1 ≤ k ≤ L, are the observation weights within the DAW. 1: For very long data assimilation windows, the use of multiple assimilation (or splitting) of observations, denoted MDA in the following, can prove numerically efficient (Bocquet and Sakov, 2013). An observation vector y is said to be assimilated with weight β (0 ≤ β ≤ 1) if the following Gaussian observation likelihood is used in the analysis: where |R| is the determinant of R. The upper index of y β refers to the partial assimilation of y with weight β. The prior errors attached to the several occurrences of one observation are chosen to be independent. In that light, the {β k } 1≤k≤L are merely the weights of the observation vectors {y k } 1≤k≤L within the DAW. Statistical consistency necessitates that a unique observation vector is assimilated in such a way that the sum of all its weights in the data assimilation experiment is 1. For instance, if 1 = S ≤ L, consistency requires that L k=1 β k = 1. In a more general case in which the observation vectors have the same number of non-zero weights, L is a multiple of S : L = QS, where Q is an integer. As a result, consistency requires Q−1 q=0 β Sq+l = 1 with l = 1, . . . , S. In the MDA case (except the SDA subcase) the optional analysis step is more complex since it requires re-weighting Chaining of the MDA IEnKS cycles. The schematic illustrates the case L = 5, and S = 2. The method performs a smoothing update throughout the window potentially using all observations within the window (marked by black dots), except for the first observation vector assumed to be already entirely assimilated. Note that the time index for the dates and the observations are absolute for this schematic, not relative.
the observations within the DAW to obtain the correct analyses for states t 1 to t L and beyond. More details that are not directly relevant to this study can be found in Bocquet and Sakov (2013). Note that when the constraint L k=1 β k = 1 is not satisfied, the underlying smoothing probability density function (pdf) will not be the one targeted, but, with well chosen {β k } 1≤k≤L , could be a power of it (Bocquet and Sakov, 2013).
These MDA approaches are mathematically consistent in the sense that they are demonstrated to be correct in the linear model, Gaussian statistics case. An heuristic argument based on Bayesian ideas justifies the use of the method in the nonlinear case (Bocquet and Sakov, 2013).
The chaining of the data assimilation cycles in the MDA case is displayed in Fig. 2.
In the experimental Sects. 3 and 4, both SDA and MDA schemes will be used.

Augmented state formalism
We wish to estimate a set of model parameters θ ∈ R P along with the state variables. To do so, the state space is aug- of the joint state and parameter space. From the mathematical point of view, the analysis step of the IEnKS is unchanged.
As is usual in a parameter estimation context, a forward model needs to be introduced for the parameters. This model could be, for instance, the persistence model (θ k+1 = θ k ), or some jittering such as a Brownian motion, could be assumed (θ k+1 = θ k + ǫ k ). Depending on the constraints on the parameters, this jittering could also be constrained.
Technically, there is nothing more in the joint state and parameter IEnKS than in the state IEnKS. As opposed to the EnKF and EnKS, the objective of the joint state and parameter IEnKS is not to build covariances to help estimate hidden parameters, but instead to minimize a cost function that depends on the full augmented state. In a strongly nonlinear context, this approach could prove superior to the standard EnKF and EnKS.
As mentioned in the introduction, the estimation of model parameters within 4D-Var requires the adjoint model. Besides, the computation of the derivative of the cost function with respect to the parameters in terms of the adjoint field can be tedious. Parameter estimation with the IEnKS avoids this time-consuming task.
A potential advantage of the IEnKS over 4D-Var is that the errors of the day are by construction estimated within the IEnKS for all types of variables or parameters, whereas the 4D-Var modeling of background statistics of heterogeneous variables and parameters can be complex (see, for instance, Elbern et al. (2007), relating the modeling of inter-species correlation in a 4D-Var applied to air quality, or Montmerle and Berre (2010) in a meteorological convective scale context).
Similarly to state estimation, joint state and parameter estimation with the IEnKS in theory combines appealing features of both variational and ensemble Kalman filtering techniques. The purpose of the following numerical exploration is to investigate whether this holds true in experiments with low-order models.

Numerical experiments with the Lorenz-95 model
The Lorenz-95 one-dimensional model (Lorenz and Emmanuel, 1998) represents a mid-latitude zonal circle of the global atmosphere. It has M = 40 variables {x m } m=1,...,M . Its dynamics is given by the following set of ordinary differential equations: for m = 1, . . . , M, and the domain is periodic (circle-like). F is chosen to be 8 so that the dynamics is chaotic and has 13 positive Lyapunov exponents. A time step of t = 0.05 is meant to represent a time interval of 6 h in the real atmosphere. Unless otherwise stated, the time interval between each observational update will be t = 0.05, meant to be representative of a data assimilation cycle of global meteorological models. With such a value for t, the data assimilation system is considered weakly nonlinear, leading to statistics of errors weakly diverging from Gaussianity. This model is integrated using the fourth-order Runge-Kutta scheme with a time step of 0.05.

Setup
Twin experiments are conducted. The truth is represented by a free model run (nature run), meant to be tracked by the data assimilation system. The system is assumed to be fully observed (d = 40) every t, so that H k = I d , with the observation error covariance matrix R k = I d . The related synthetic observations are generated from the truth, and perturbed according to the same observation error prior. The performance of a scheme is measured by the temporal mean of a root mean square difference between a state estimate (x a ) and the truth (x t ). Typically, one averages the following analysis root mean square error (RMSE): over the data assimilation cycles. When this RMSE concerns the system state at present time, i.e., the state at the end of the DAW, we call it the filtering RMSE. When this RMSE concerns the state defined L t in the past, i.e., at the beginning of the DAW, we call it the smoothing RMSE. All data assimilation runs will extend over 10 5 cycles after a burn-in period of 5×10 3 cycles. This guarantees a sufficient convergence of the error statistics. Unless otherwise stated, the size of the ensemble used with the ensemble methods will be N = 20, which is greater than the size of the unstable subspace, and, in the case of this model, makes localization unnecessary.
In this context, we have chosen to implement the inflation using the finite-size counterparts of the filters/smoothers (Bocquet et al., 2011). For this model, except in quasilinear conditions ( t ∼ 0.01), this inflation leads to performances that are quantitatively very close to the same filter/smoother with optimally tuned uniform inflation (Bocquet et al., 2011;Bocquet and Sakov, 2012). In the following methods like EnKF/IEnKS/EnKS should be understood as EnKF/IEnKS/EnKS with optimally tuned uniform inflation, and will actually be implemented with a single run of the finite-size variants, i.e. EnKF-N/IEnKS-N/EnKS-N, which is much more economical. Any reader not interested in implementing the finite-size IEnKS (whose pseudo-code is presented in Algorithm 2), or IEnKS-N, can alternatively optimally tune the uniform inflation of an EnKF/IEnKS/EnKS to attain very similar results.

New experiments with the IEnKS
This section is meant to recall and extend to 4D-Var and the case S = L several numerical tests of Bocquet and Sakov (2013), before considering joint state and parameter estimation. The following five systems are compared:  Require: Same requirements as algorithm 1. ε N = 1. 12: -The SDA IEnKS, S = 1.
-The MDA IEnKS, S = 1. The {β k } 1≤k≤L are chosen to be uniform in the DAW and constant in time.
-The SDA IEnKS, with S equal to the length of the DAW S = L, so that the DAWs do not overlap. This approach is meant to be computationally economical, and is much more economical than the quasi-static case S = 1, since there is no overlapping of DAWs.
-4D-Var with a shift S = 1, corresponding to overlapping DAWs and quasi-static conditions. The gradient is obtained by finite differences, which is affordable and precise enough in this small dimensional context. The performance of 4D-Var strongly depends on the background statistics. Since the correlations in the Lorenz-95 system are rather short-ranged, the B-matrix is chosen diagonal. The performance of 4D-Var does not vary much if we introduce some correlation and offdiagonal terms. However, the scaling of the B-matrix is crucial in this context (Kalnay et al., 2007). The longer the DAW is, the smaller the scaling factor should be, since the first guess becomes more accurate. For each experiment, we tuned this scaling so as to obtain the best filtering analysis RMSE.
To avoid tuning inflation, the finite-size variants of the filters and smoothers are employed (SDA IEnKS-N, MDA IEnKS-N, EnKS-N). All EnKF and EnKS, and their finitesize variants in this article are based on the ensemble transform square root Kalman filter (Bishop et al., 2001;Hunt et al., 2007;Bocquet et al., 2011). These five data assimilation systems are compared in weakly nonlinear conditions ( t = 0.05) chosen to roughly represent synoptic scale meteorology dynamics (Lorenz and Emmanuel, 1998), and more nonlinear conditions ( t = 0.20 between updates). The time-averaged analysis RMSE is plotted in Fig. 3 for the former case, and in Fig. 4 for the latter case, as a function of the length of the DAW.
Let us first notice that the filtering performance of the EnKS is, by construction, given by that of the EnKF, whatever the length of the DAWs. This explains why the filtering RMSE of EnKS is constant, modulo statistical noise. When comparing the filtering performances of the EnKF/EnKS and 4D-Var, the conclusions of Kalnay et al. (2007) are reinforced. 4D-Var does not perform as well for short DAWs and performs better for long DAWs. In addition, we note that the same conclusion applies to the smoothing performance, even though the crossover point might be different.
Considering filtering as well as smoothing, the MDA IEnKS S = 1 significantly outperforms 4D-Var and the EnKF/EnKS in all regimes. The SDA IEnKS S = 1, also performs very well, but its performance wanes with longer DAWs, which is why the MDA IEnKS was introduced by   Bocquet and Sakov (2013). For very short DAWs (L = 1, 2 in the case t = 0.05), the performances of the SDA IEnKS S = 1 and MDA IEnKS S = 1 are equal (L = 1) or very close (L = 2). For intermediate DAW lengths, the SDA IEnKS S = 1 can slightly outperform MDA IEnKS S = 1. This is not surprising, since the SDA IEnKS algorithm is meant to be optimal for sufficiently short DAWs, whereas the MDA IEnKS algorithm is only guaranteed to be optimal in linear/Gaussian conditions. Practically, in weakly nonlinear conditions ( t = 0.05), the IEnKS S = 1 only requires one to two propagations of the ensemble within the DAW. Consistently, it was shown in Bocquet and Sakov (2013) that a linearized variant of the algorithm, requiring one propagation of the ensemble within the DAW to compute the sensitivity, performed just as well in these conditions. It is nevertheless tempting to check whether this cost can be reduced by using non-overlapping windows S = L, and performing the analysis every L t. This would divide the cost of model runs by L, but this effect might nevertheless be offset by an higher number of iterations required for the analysis.
Quite surprisingly, the SDA IEnKS S = L performs very well for DAWs of length smaller than 0.80 (about twice the doubling time of the Lorenz-95 model). It is useless beyond that length, which was to be expected since the background at the beginning of the DAW results from a long forecast within the DAW, as opposed to a forecast of only t in the quasistatic S = 1 case.
In stronger nonlinear conditions, the variational methods (4D-Var and IEnKS) easily outperform the EnKF/EnKS. In particular, 4D-Var outperforms the EnKF/EnKS as soon as the the DAW reaches L = 2.

Joint state and forcing F estimation
A twin experiment is conducted in a situation where F is unknown. The true model (nature run) has forcing F = 8. The model used for assimilation and forecast has the initial value F = 7.
In addition to the state variables, the forcing parameter F will be estimated as well. Hence, the state vector x ∈ R M with M = 40 will be extended to the joint vector of size M + P = 41, with its 41st entry being the forcing parameter. The persistence model will be assumed for the evolution of the model parameter.
Because the filters and smoothers used here are all deterministic, the only source of stochasticity to generate the variability in F comes from the initialization of the ensemble. The forcing parameter of a member is initialized to 7 + ε, where ε is independently drawn from a normal distribution of standard deviation 0.1. The augmented state IEnKS will be compared to several augmented state alternatives. Specifically, we shall consider in this experiment: -The ensemble Kalman filter (EnKF). corresponds to the optimal performance for the smoothing estimation of F by the EnKS. The forcing parameter is plotted in Fig. 5, over a 5 × 10 3cycle-long segment of the experiment. In the case of the EnKS, the smoothing estimator for F (at the beginning of the DAW) is plotted because it is better than the filtering estimate of F (at the end of the DAW). Because the persistence model is assumed for F , the smoothing and the filtering estimates of F are the same for the IEnKS and 4D-Var. In addition, because the true F is static, the smoothing and filtering RM-SEs should coincide. From Fig. 5, it is clear that the IEnKS significantly outperforms the EnKF and the EnKS.
The time-averaged analysis root mean square errors (RM-SEs) are computed over a much longer run of 10 5 cycles. The scores for the state variables are reported in Fig. 6. The filtering RMSEs (i.e., the RMSEs at present time) of the EnKF or of the EnKS for any L are, by construction, the same. The estimation of the forcing F is good enough that the performance is indistinguishable from the EnKF performance when F = 8 is known. Nevertheless, even in this weakly nonlinear regime, the IEnKS with L ≥ 1 outperforms the EnKF and EnKS. Confirming the results of Bocquet and Sakov (2013), the gap in the smoothing performance between the EnKS and the IEnKS significantly increases as L increases. In this weakly nonlinear regime, the number of iterations required by the IEnKS is close to one, and its performance equals that of the linearized IEnKS (Bocquet and Sakov, 2013).
The scores for the estimation of the forcing parameter are reported in Fig. 7. By construction, the filtering performance of the EnKF and the EnKS at any L is the same, approximately 0.018. The parameter smoothing RMSE for the EnKS is approximately 0.015 and is optimal for L ∼ 100. By construction, the analysis at present time and retrospective analysis of F by the IEnKS is the same. Even in the case L = 1, the so-called iterative ensemble Kalman filter (IEnKF) outperforms the EnKS with an RMSE of 0.013. With increasing L, this performance improves more and more, reaching the RMSE of 7.5 × 10 −4 for L = 50.
The estimation of 4D-Var only becomes better than that of the EnKF for DAWs of length L = 50. This counterperformance can only be explained by a poor specification of the error covariance matrix. Indeed, the scaling of the background error statistics for the state variables should be different from the scaling of the background error statistics for the parameter. However, the separate tuning of scalings requires additional work that the IEnKS does not require. This hypothesis will be checked in Sect. 5.

Numerical experiments with a coupled Lorenz-95tracer model
In this section we introduce a simple extension of the Lorenz-95 model with a tracer field advected by the Lorenz-95 field to represent an advective wind. This is meant to test the ability of the IEnKS to carry out joint state and parameter estimation in the dynamical context of an online atmospheric chemical model, with heterogeneous variables.

Extending the Lorenz-95 model
We shall think of the variables x m of the Lorenz-95 as wind speed and direction variables defined on the circle. A tracer field c m+ 1 2 , m = 1, . . . , M = 40 will be added to the model variables, for a total of 80 variables. These variables are defined on the circle using a C-grid. A schematic of the grid is shown below: The tracer is advected by the wind field of the Lorenz-95 model. We have chosen to use the simple Godunov upwind scheme, which is positive and conservative. It is quite diffusive but this diffusion could be seen as a feature of the modeled physics. The equations read: The tracer is emitted on the whole domain, and the emission fluxes are denoted E m+ 1 2 . It is deposited on the whole domain, using a simple scavenging scheme parameterized by a scavenging ratio λ. A stationary point of the dynamics is x m = F and c m+ 1 2 = E m+ 1 2 /λ. This provides orders of magnitude for the wind and concentration variables.
For simplicity, the emission flux will be made constant and uniform: E m+ 1 2 ≡ E. Obviously, however, a more complex setting with urban/rural/sea emission type and diurnal/nocturnal cycle could be chosen. The values of our reference simulation's parameters are λ = 0.1, and E = 1, so that the typical concentration value is 10.
The Courant-Friedrichs-Lewy (CFL) condition is almost always satisfied: the Lorenz-95 model variables |x m | very rarely exceed 15; by construction, one has t = 0.05 and x = 1, so that CFL ≤ 0.75 < 1. Free run simulations help one to understand some dynamical characteristics of the model. The model exhibits features of a realistic tracer model. For instance, consider two of the model's distinct trajectories in which the wind fields model trajectories are the same. It turns out that the concentrations c m+ 1 2 of the two trajectories converge with each other. The positive part of the Lyapunov spectrum of the model is consistently very close to that of the Lorenz-95 model, with a number of positive Lyapunov exponents equal to 13, and a much broader negative part of the Lyapunov spectrum. However, we observed that the relaxation time of such two trajectories is quite long (typically τ = 10), so that it seems difficult to break down the system into fast and slow dynamics.
A free run (after spin-up) is displayed in Fig. 8. The peaks of the tracer are correlated with the waves of the Lorenz-95, though not in an obvious way (see Sect. 5).
The causality and propagation of information in this model is special, and presumably similar to much more complex online atmospheric chemistry models. This impacts the effectiveness of data assimilation. For instance, measuring a tracer plume at t 0 (actually a peak in this one-dimensional context) does not enable one to detect a swift change in the local wind at t 0 . Only future observations of the tracer concentrations will enable a diagnosis of this change in the local wind. As a consequence, variational schemes such as 4D-Var and the IEnKS that work over larger DAWs appear to be ideal tools in this context.

Numerical tests
We have performed data assimilation tests of the IEnKS using this model in order to estimate winds and concentrations, and unknown parameters F and E. Initially, we had carried out the same test but estimating E and λ instead of F and E. Parameters E and λ are typical of the kind one would like to control in an atmospheric chemistry model to improve forecast and re-analysis, when they are not themselves the focus of interest (Bocquet, 2012). The results were quite similar to those presented here. Yet, because the deposition and emission are antagonistic processes, the inverse problem of estimating them is very ill posed, requiring a specific prior distribution for those two parameters. In the absence of such strongly constraining prior, 4D-Var's performance would be hampered. That is why we choose to estimate F and E instead.
One of the potential difficulties in data assimilation with this model is the positivity of the concentrations c m+ 1 2 ≥ 0 and of the parameters F ≥ 0 and E ≥ 0. This problem can be dealt with straightforwardly with 4D-Var, since the positivity of the variables can be enforced by the minimizer, or by a change of variables that is easy to implement in this context. The problem is more severe with the EnKF, since the Best Linear Unbiased Estimator (BLUE) analysis and the ensemble generation that are at the heart of the methods will generate negative concentrations or parameters. A simple but fairly effective trick is to perform clipping by setting all negative variables of the analysis to zero; this, however, is suboptimal and could also induce imbalances and harming positive biases. A more elegant solution is to perform an analytical anamorphosis (Cohn, 1997;Bocquet et al., 2010;Simon and Bertino, 2012), so that the BLUE analysis and the ensemble generation are carried out in a space where the variables are defined on R and their statistics are closer to a Gaussian. For instance, in our case one could perform the state augmentation using the extended state vector: Note that in practice this problem does not apply to F , which is well estimated and close to F = 8 because of a strong sensitivity of the model to F . The choice of ln(F ) as the parameter to be estimated is only justified by the need of an homogeneous error metric for the two parameters.
Our numerical tests (EnKF as well as IEnKS) showed that the anamorphosis on the concentration variables is useless for improving precision, and can even lead to instability. This is at variance with the findings of Simon and Bertino (2012) who applied anamorphosed analysis on a 1D ocean ecosystem model, and who also found benefit in using anamorphosis on the state variables. Choosing a more complex gamma or lognormal distribution for anamorphosis function would avoid favoring large concentration values, as does the instability-prone logarithm anamorphosis (L. Bertino, personal communication, 2013). Aside from the choice of the anamorphosis function, this difference can also be explained by the fact that static anamorphosis is more efficient on distributions that are not too dynamical.
In addition, we found that occurrences of negative concentrations in the analysis and the posterior ensemble are extremely rare in the present case. By contrast, we found that the anamorphosis on E is useful and avoids instabilities. Because parameters F and E are not observed, their anamorphosis is a mere change of variables (as in 4D-Var) that does not require much work. Therefore, in the following the extended state vector will be A twin experiment, similar to that described in Sect. 3, is performed with t = 0.05. The winds and the concentrations are fully observed, with R d = I d , d = M = 40, in the wind observation space as well as in the tracer concentration space. The observations are generated from the truth and perturbed according to these error statistics. All runs are performed over 10 5 cycles after a burn-in period of 5 × 10 3 cycles. The following methods are compared: -The SDA IEnKS S = 1.
-The MDA IEnKS S = 1. The {β k } 1≤k≤L are chosen to be uniform in the DAW and constant in time.
-4D-Var with S = 1, corresponding to overlapping windows, quasi-static conditions. The scaling of the background is tuned so as to minimize the global (on all 82 extended variables) RMSE.
To avoid tuning inflation, the finite-size variants are employed: SDA/MDA IEnKS-N and EnKS-N.  The time-averaged analysis RMSEs on the wind and concentration variables are plotted in Fig. 9 as a function of the DAW length. Both the mean filtering and smoothing RMSEs are reported. Again, the results are consistent with those of Kalnay et al. (2007). 4D-Var is not as precise as the EnKF/EnKS for short DAWs (L ≤ 20), but it outperforms the EnKF/IEnKF for large DAWs, in both filtering and smoothing. Moreover, the IEnKS significantly outperforms the EnKS/EnKF in all regimes and for both filtering and smoothing. In terms of performance, the difference between the SDA IEnKS and the MDA IEnKS is very similar to that reported in Sect. 3. However, the RMSE differences are much weaker, which may be explained by the doubled number of observations. The RMSEs of the logarithm of the two parameters, i.e., where F t = 8 and E t = 1 are plotted in Fig. 10 as a function of the DAW length. The filters and smoothers perform significantly better than 4D-Var. The EnKF/EnKS and 4D-Var remain quite far from the performance of the SDA and MDA IEnKS. This shows that smoothing over a large window and flow-dependent error statistics are both crucial for the parameter estimation.

Discussion
One (1 ≤ S ≤ 15), the IEnKS is performing well, and better that the EnKF/EnKS and 4D-Var. In the case of weak nonlinearity t = 0.05, only one iteration of the minimization is required on average for the computation of the sensitivities Y k,(j ) . Additionally, accounting for the propagation of the ensemble of the (ensemble) forecast step, an average of two propagations of the ensemble through the DAW is required. A further exploration of the computational performance of the IEnKS is out of scope of this article, but it seems quite promising for the success of the IEnKS with complex models.
In the numerical experiments, parameters F and E were chosen to be static. This type of parameters is frequently modelled in geophysical systems. Furthermore, they make 4D-Var and the IEnKS with large DAW ideal tools. When the parameters evolve in time, the variational methods may not perform as well as the EnKF, EnKS and IEnKS with smaller DAWs. In particular, the persistence model for the parameters becomes imperfect. We have repeated the same experiment as in Sect. 3.3, but with F varying in time according to a sinusoid and a step-wise function, within the interval [7.5; 8.5], with a period of one year (1456 time units of the . Not only is model error made intrinsic by incorporating parameter F in the control variables as has been done so far, but model error also becomes extrinsic because the assumed persistence model for F is wrong (permanently in the sinusoid case and intermittently in the step-wise case). Some results are displayed in Fig. 11. The evolution of the retrospective analysis of F is shown for the EnKF-N, the EnKS-N L = 50, the MDA IEnKS-N L = 50 S = 1, and 4D-Var L = 50 S = 1. The RMSEs are indicated in parenthesis in the legends. Although the IEnKS-N L = 50 remains the best performer in both cases, the gap in performance is narrower, because of the incorrect persistence assumption within the DAW. Let us remark that, in these cases, the RMSE of the retrospective analysis of the IEnKS is different from the RMSE of the filtering analysis because the truth that serves as a point of comparison changes within the DAW. Note also that because of the imperfection of the persistence model, a multiplicative inflation of 1.01 of the ensemble anomalies has been applied to the finite-size methods since they are not meant to intrinsically account for extrinsic model error (Bocquet et al., 2011), whereas the EnKF requires an inflation of 1.05 here to account for both model and sampling errors.
The last and main point of the discussion is dedicated to the improvement of the 4D-Var background and its comparison to the IEnKS. The background error statistics that determine the prior of variational methods, such as 4D-Var and the IEnKS, have less impact with longer data assimilation windows. These methods estimate the background state by a forecast of the analysis or of the posterior ensemble. Nevertheless, in the context of our low-order models, the IEnKS outperforms 4D-Var, especially in the joint state and parameter case. Therefore the difference should lie in the specification of the background error covariance matrix. It could be that the time-dependence of the background error statistics remains essential in the long DAWs length limit. Or it could be that the climatological statistics of the background in our implementation of 4D-Var is poorly specified.
To explore those hypotheses, we derived climatological statistics of the errors on the initial state of the DAW inferred from the SDA IEnKS-N. We first considered the Lorenz-95 model without parameter estimation. Since the system is statistically homogeneous, the error covariance matrix is circulant, so that is can be represented by a one-dimensional structure function that depends on the distance between sites on the circle. The correlation structure function is plotted in Fig. 12 instead of the full correlation matrix. When L is varied, the differences are small, except in the case L = 1 that shows slightly modified next-to-nearest correlations. The related covariance matrix, defined up to an optimally tuned scaling parameter, is used in 4D-Var as a new prior in place of the identity matrix. The new 4D-Var scores barely change from when using a covariance matrix proportional to the identity. This is consistent with the findings of Kalnay et al. (2007) who also tried, in a similar experimental context, to improve the performance of 4D-Var with a finer error covariance structure.
In a second experiment, we derived the climatological statistics of the errors on the initial extended state vector, in the Lorenz-95 case when F is unknown and estimated. The error covariance matrix turns out to be almost identical to the Lorenz-95 case with a fixed F = 8. The only difference is in the covariances that involve F . As expected, the correlation between the error on F and the error on any state variable is uniform. The covariance of the identity that was used in Sect. 3.3 is clearly not a good model for this case. Furthermore, the errors on F and the state variables are not homogeneous, so, again, choosing the error covariance matrix proportional to the identity matrix is not ideal. The climatological statistics of the errors have been inferred from the SDA IEnKS-N, when the state vector is augmented to incorporate F . It was done for each L because the ratio of the variances of the error on a typical state variable to the error on F is a non-uniform but increasing function of L. Using this procedure, we did not obtain real improvement in the state variable RMSE. However, the precision of the parameter estimation was remarkably improved to the level of the IEnKS-N. This shows that a fine specification of the background statistics is very helpful in the estimation of the static parameter F . Nevertheless, these statistics are static, and they do not significantly aid the estimation of the rapidly changing state variables.
In the last experiment, we applied the same procedure to the online tracer model based on the Lorenz-95 model. The error covariance matrix derived from the SDA IEnKS, for several L, entails significant covariances between the wind field and the concentration field. Again, because of the statistical homogeneity of the subsystems, one can represent the correlations of the winds and of the concentrations, and the cross-correlations between the winds and the concentrations, by using structure functions. These structure functions are displayed in Fig. 13, obtained from the IEnKS-N, L = 20.
We believe that the structure function of the crosscorrelations is non-symmetric because of the preferred orientation of the winds. The waves in the Lorenz-95 preferentially travel westward and create fronts of tracer on one preferred side of the wave, yielding a non-trivial crosscorrelation structure function. For 4D-Var, results similar to those from the previous experiment were obtained. Using the climatological priors, the errors of the state variables barely reduce. The fine correlations that build between the errors of the wind and concentration variables are dynamical and seem to be of little use when averaged in the climatological background error covariance matrix. However, the parameters are much better estimated. Nevertheless, unlike in the previous experiment, they do not quite match the precision of the IEnKS. For instance, in the L = 1 case, the 4D-Var RMSE of the logarithm of the parameters is reduced from 1.5 × 10 −1 to 1.3 × 10 −2 , but is still far from 1.0 × 10 −3 of IEnKS.

Conclusion
In this article, the iterative ensemble Kalman smoother (IEnKS) has been explored numerically. Using the Lorenz-95 low-order model, it has been compared to the ensemble Kalman filter (EnKF) and the standard ensemble Kalman smoother (EnKS). It has also been compared to a 4D-Var, for a wide range of data assimilation window (DAW) lengths. The IEnKS systematically outperformed the EnKF, EnKS and 4D-Var. This conclusion holds true even when the background error covariance matrix of 4D-Var is better specified and tuned. The IEnKS has been extended to joint state and parameter estimation, using the augmented state formalism. Endowed with assets of 4D-Var and ensemble Kalman methods, IEnKS appeared to us as an ideal candidate method to tackle such problems.
It was applied to the joint estimation of the Lorenz-95 state vector and its forcing parameterF . The IEnKS outperformed the EnKF, EnKS and 4D-Var for a wide range of lengths of the DAW. In addition, the estimation of F was shown to be even more precise compared to the standard methods.
Motivated by future applications of the IEnKS to atmospheric chemistry models where the estimation of the forcings is crucial, we introduced an extension of the Lorenz-95 model, adding a tracer field advected by the Lorenz-95 field. Key parameters of the tracer emission and deposition were also meant to be estimated. Again, the IEnKS managed to finely estimate the parameters without any tuning.
A better specification of the error covariance matrix of 4D-Var (obtained from the IEnKS) led to a spectacular improvement in the estimation of the static parameters. Yet, it did not help in the improvement of joint estimation of the state variables. This stresses the importance of the time-dependence of the error covariance matrix for rapidly varying variables. By contrast, the IEnKS completely avoids the need to build any background error covariance matrix.
One way to account for model error is to parameterize this error and estimate the related parameters, as was suggested in this study. Another way is to implement a weak constraint formulation of the underlying variational problem. However, this formulation remains to be defined for the IEnKS, whereas it is already implemented in the standard EnKS.
Following this study, we are planning to test the IEnKS on a more complex low-order model with several reactive species and test the estimation of the concentration variables as well as some parameters such as kinetic constants. If this is successful, our eventual plan is to implement the method on a high-dimensional air quality model. However, the definition of a satisfying implementation of localization in the IEnKS context will first be needed (work in progress).