Behavior of the iterative ensemble-based variational method in nonlinear problems

The behavior of the iterative ensemble-based data assimilation algorithm is discussed. The ensemble-based method for variational data assimilation problems, referred to as the 4D ensemble variational method (4DEnVar), is a useful tool for data assimilation problems. Although the 4DEnVar is derived based on a linear approximation, highly uncertain problems, in which system nonlinearity is significant, are solved by applying this method iteratively. However, the ensemble-based methods basically seek the solution within a lower-dimensional subspace spanned by the ensemble members. It is not necessarily trivial how high-dimensional problems can be solved with the ensemble-based algorithm which employs the lower-dimensional approximation based on the ensemble. In the present study, an ensemble-based iterative algorithm is reformulated to allow us to analyze its behavior in high-dimensional nonlinear problems. The conditions for monotonic convergence to a local maximum of the objective function are discussed in a high-dimensional context. It is shown that the ensemble-based algorithm can solve highdimensional problems by distributing the ensemble in different subspace at each iteration. The findings as the results of the present study were also experimentally supported.


Introduction
The 4D ensemble variational method (4DEnVar; Lorenc, 2003;Liu et al., 2008) is a useful tool for practical data assimilation. The 4DEnVar obtains the derivative of the objective function from the approximate Jacobian of a dynamical system model, which is estimated by using the ensemble of simulation results. In contrast with the adjoint method, the 4DEnVar does not require an adjoint code which is usually time-consuming to develop. This ensemble method thus allows us to treat the simulation code as a black box, and it can easily be implemented.
The 4DEnVar algorithm is derived based on a lowdimensional linear approximation of the high-dimensional nonlinear system model. If the uncertainties in state variables are small, then the solution could be found within the range where a linear approximation is valid. However, geophysical systems are often highly uncertain. If the scale of uncertainty is much larger than the range of linearity, a linear approximation would not be justified. In atmospheric applications, uncertainty can usually be reduced by taking sufficient spin-up time. On the other hand, in some geophysical applications, it is difficult to obtain a sufficiently long sequence of observations to allow spin-up. For example, in data assimilation for the interior of the Earth, such as lithospheric plates (e.g., Kano et al., 2015) and the outer core (e.g., Sanchez et al., 2019;Minami et al., 2020), the timescale of the system dynamics is so long that a sufficient length of an observation sequence is not feasible. It is also difficult to use a long sequence of observations in the Earth's magnetosphere where the amount of observations is limited (e.g., Nakano et al., 2008;Godinez et al., 2016). It is therefore an important issue to consider large uncertainties which could deteriorate the validity of the linear approximation.
Several studies have suggested that estimations in nonlinear problems can be improved by iterative algorithms in which the ensemble is repeatedly updated in each iteration (e.g., Gu and Oliver, 2007;Kalnay and Yang, 2010;Chen and Oliver, 2012;Sakov, 2013, 2014;Raanes et al., 2019). These iterative algorithms can be regarded as a S. Nakano: Ensemble variational method variant of the 4DEnVar method, based on an approximation of the Gauss-Newton method or the Levenberg-Marquardt method. The Gauss-Newton and the Levenberg-Marquardt methods are variants of the Newton-Raphson method for solving nonlinear least squares problems by using the Jacobian of a nonlinear function. Thus, when the Gauss-Newton or the Levenberg-Marquardt framework is strictly applied to data assimilation problems, the tangent linear of the system model is required. Indeed, if the tangent linear of the system model is obtained, 4D variational data assimilation problems can be solved with the incremental formulation (Courtier et al., 1994), which can be regarded as an instance of the Gauss-Newton framework (Lawless et al., 2005). The ensemble-based methods avoid computing the Jacobian of a nonlinear system model by a linear approximation using the ensemble. This ensemble-based approximation is justified if linearity can be assumed over the range where the ensemble members are distributed. However, the ensemblebased methods basically seek the solution within a lowerdimensional subspace spanned by the ensemble members. In many applications in atmospheric sciences, it has been demonstrated that the localization of the covariance matrix is useful for coping with high-dimensional problems (e.g., Buehner, 2005;Liu et al., 2009;Buehner et al., 2010;Yokota et al., 2016). However, it has not necessarily been clarified how general high-dimensional problems, in which the localization of the covariance matrix might not be appropriate, can be solved with the ensemble-based algorithm which employs the lower-dimensional approximation based on the ensemble.
The present study aims to reformulate an ensemble-based iterative algorithm in order to analyze its behavior in highdimensional nonlinear problems. We then explore the conditions for achieving monotonic convergence to a local maximum of the objective function in a high-dimensional nonlinear context. The monotonic convergence means that the discrepancies between estimates and observations are reduced in each iteration. It is ensured that the algorithm would attain a satisfactory result in high-dimensional problems if the ensemble is distributed in a different subspace at each iteration. This study is originally motivated by data assimilation into a geodynamo model to which the author contributed (Minami et al., 2020). However, the present paper focuses on the iterative variational data assimilation algorithm for general uncertain problems in order to avoid the discussion on specific physical processes of geodynamo. In Sect. 2, the formulation of the variational data assimilation problem is described. In Sect. 3, the basic idea of the ensemble variational method is explained. The iterative version is introduced as an algorithm for maximizing the log-likelihood function in Sect. 4, and the behavior of the iterative algorithm is evaluated in Sect. 5. In Sect. 6, a Bayesian extension is introduced. Section 7 experimentally verifies our findings. Finally, a discussion and conclusions are presented in Sect. 8.

The 4D variational data assimilation (4DEnVar)
In the following, the system state at time t k is denoted as x k , and the observation at t k is denoted as y k . We consider a strong-constraint data assimilation problem where the evolution of state x k is given by the following: and the relation between y k and x k is written in the following form: where w k indicates the observation noise. Assuming that w k obeys a Gaussian distribution with mean 0 and covariance matrix R k , then, in the following: The likelihood of x k given y k is as follows: Since we assume a deterministic system as stated in Eq. (1), h k (x k ) can be written as a function of an initial value x 0 as follows: where g k is the following composite function: The likelihood in Eq. (4) is then written as follows: When the prior distribution of x 0 is assumed to be Gaussian with a mean x 0,b and covariance matrix P 0,b defined by the following: the Bayesian posterior distribution of x 0 , given the whole sequence of observations from t 1 to t K , y 1:K , can be obtained as follows: The maximum of the posterior can be found by maximizing the following objective function: 3 Ensemble-based method The maximization of the objective function J is conventionally performed by the adjoint method, which differentiates J based on the adjoint matrix of the Jacobian of the function f k in Eq. (1). For a practical high-dimensional simulation model, however, it is an extremely laborious task to develop the adjoint code which represents the adjoint matrix of the Jacobian of the forward simulation model. The 4DEnVar is an alternative method for obtaining an approximate maximum of J without using the adjoint code. The 4DEnVar employs an ensemble of N simulation results x (1) 0:K , where x 0:K indicates the whole sequence of the states from t 0 to t K ; that is, The initial state of each ensemble member x (i) 0 is assumed to be sampled from the Gaussian distribution N (x 0 ; x 0,b , P 0,b ). The objective function in Eq. (10) is approximated by using this ensemble.
For convenience, we define the following matrix X 0,b from the initial states of ensemble members: Assuming that the optimal x 0 can be written as a linear combination of the ensemble members, we can write x 0 in the following form: This assumption means that x 0 is within the subspace spanned by the ensemble members. The quality of an estimate with the 4DEnVar can thus be poor if there are insufficient ensemble members. In practical applications of the 4DEnVar, a localization technique is usually used to avoid this problem (e.g., Buehner, 2005;Liu et al., 2009;Buehner et al., 2010;Yokota et al., 2016). However, the present paper does not consider localization because the focus here is on the basic behavior of the 4DEnVar. If we assume that the rank of X 0,b is N (< dimx 0 ) and approximate the inverse of P 0,b by the Moore-Penrose inverse matrix of X 0,b X T 0,b , the first term of the right-hand side of Eq. (10) can be approximated as follows: This corresponds to a low-rank approximation within the subspace spanned by the ensemble members. The prior mean x 0,b is usually given by the ensemble mean of x . In such a case, it is necessary to ignore the subspace along the vector 1 = (1· · ·1) T to reach the approximation of Eq. (13). The function g k (x 0 ) is approximated based on the first-order Taylor expansion as follows: where G k is the Jacobian of g k at x 0,b . The matrix G k X 0,b in Eq. (14) is approximated as follows: Defining the right-hand side of Eq. (15) as k , that is, in the following: we obtain a further approximation of the function g k (x 0 ) in Eq. (14) as follows: (e.g., Zupanski et al., 2008;Bannister, 2017). Using Eqs. (13) and (17), the objective function in Eq. (10) can be approximated as a function of w as follows: where we definedĴ w (w) = J (x 0,b + X 0,b w).
The approximate objective functionĴ w is a quadratic function of w, and it no longer contains the Jacobian of the function g k . The maximization ofĴ w is thus much easier than that of the original objective function in Eq. (10). The derivative ofĴ w with respect to w becomes the following: (Liu et al., 2008). The Hessian matrix ofĴ w is then obtained as follows: We can thus immediately find the value of w when maximiz-ingĴ w as follows: Insertingŵ into Eq. (12), we obtain an estimate of x 0 as follows:  Evensen, 1996;Evensen and van Leeuwen, 2000), although the whole sequence of observations is referred to in Eq. (21). Even if a large amount of data are used, it would not seriously affect the computational cost because the computation of the inverse matrix can be conducted in N-dimensional space. This is also an advantage of the ensemble-based method.

Iterative algorithm
Since Eqs. (21) and (22) do not require the Jacobian of the function g k , it can be applied as a post-process -provided that an ensemble of the simulation runs is prepared in advance. However, this solution, which maximizes the objective function in Eq. (18), relies on Eq. (15) which approximates the matrix G k X 0,b by using the ensemble. This approximation is based on the first-order approximation shown in Eq. (14). Where x 0 exhibits high uncertainty and x 0 − x 0,b can be large, this approximation appears to be invalid. Therefore, it is not guaranteed that the estimate with Eq. (22) provides the optimal x 0 which maximizes the original logposterior density function in Eq. (10), even if we accept that the solution is limited within the ensemble subspace.
Where the initially prepared ensemble is used, it is unlikely that a better solution than Eq. (22) could be achieved. We then consider an iterative algorithm which generates a new ensemble based on the previous estimate in each iteration. The algorithm introduced in the following is basically the same as the method referred to as the iterative ensemble Kalman filter Sakov, 2013, 2014), but we employ a formulation that allows evaluation of the behavior and a slight extension. To derive an algorithm analogous to that in the previous section, we at first consider the following log-likelihood function: instead of the log-posterior density function in Eq. (10). Maximization of the Bayesian-type objective function in Eq. (10) will be discussed in Sect. 6. In the following, we combine the vectors of the whole time sequence from t 1 to t K into one single vector; that is, y = y 1:K and g(x 0 ) = g 0:K (x 0 ). The covariance matrices R 1 , . . ., R K are also combined into one block diagonal matrix R, which satisfies the following: Accordingly, the log-likelihood function of Eq. (23) is rewritten as follows: In our iterative algorithm, the mth step starts with an ensemble of initial values x (1) 0,m−1 obtained in the neighbor of the (m − 1)th estimate x 0,m−1 . Typically, the ensemble is generated so that the ensemble mean is equal to x 0,m−1 ; that is, in the following: although it is not necessary to satisfy this equation. A simulation run initialized at x 0,m−1 ), and we obtain the ensemble of the simulation results . Defining the matrices as follows: we consider the following mth objective function: where σ m is an appropriately chosen parameter. This objective functionJ ,m is maximized when, in the following: and w m provides the mth estimate of x 0,m as follows: Unless converged, members of the next ensemble are generated in the neighbor of x 0,m so that x (i) 0,m − x 0,m 2 is small for each i, and we proceed to the next iteration. By iterating the above procedures until convergence, the optimalx 0 which maximizes J is attained.
The form ofJ ,m in Eq. (29) looks similar to that ofĴ w in Eq. (18). However, the meaning of the first term of Eq. (29) is different from that of the first term of Eq. (18). The first term of Eq. (18) corresponded to the Bayesian prior. On the other hand, the first term of Eq. (29) is a penalty term to ensure monotonic convergence, as explained later. After iterations until convergence, the contribution of this penalty term would decay, and the log-likelihood function in Eq. (23) is maximized in the end.
We can consider various ways to obtain an ensemble satisfying Eq. (26). Bocquet and Sakov (2013) proposed obtaining a matrix X m as a scalar multiple of X 0,b , as follows: where X 0,b is a matrix defined in Eq. (11). A new ensemble for next iteration is generated to satisfy the following: As discussed later, α m should be taken to be so small that a linear approximation is valid over the range of the ensemble dispersion. The value of α m can be fixed at a small value. Otherwise, α m may be reduced gradually in each iteration so that the spread of the ensemble eventually becomes small. We can also shrink the ensemble by using a similar scheme to the ensemble transform Kalman filter (Bishop et al., 2001;Livings et al., 2008), which obtains X m as outlined by Bocquet and Sakov (2012); that is, in the following: where T m is the ensemble transform matrix given as follows: In Eq. (35), I is the identity matrix and U m m U T m is the eigenvalue decomposition of the matrix σ −2 m T m−1 R −1 m−1 , where U m is an orthogonal matrix consisting of the eigenvectors, and the matrix m is a diagonal matrix of the eigenvalues.
If the ensemble is updated according to Eq. (32) or (34), the estimate of x 0 is constrained within the subspace spanned by the initial ensemble members x (1) 0,0 , . . ., x (N) 0,0 . We can avoid confining the ensemble within a subspace by randomly generating ensemble members from a Gaussian distribution with a mean x 0,m and variance Q m as follows: Although this method has a limitation when applying it to Bayesian estimation, as explained later, it would be effective if applicable. The iterative algorithm is summarized in Algorithm 1. The procedures in this iterative algorithm are similar to those in the ensemble-based multiple data assimilation method Reynolds, 2012, 2013), which aims to obtain the maximum of the Bayesian posterior function, especially if the ensemble is updated with Eq. (34). The multiple data assimilation method does not perform iterations until convergence, but it performs iterations only a few times to estimate the maximum of the posterior, although it can provide a biased solution in nonlinear problems (Evensen, 2018). In order to achieve the convergence to the maximum of the Bayesian posterior in our framework, the objective function in each iteration should be modified as discussed in Sect. 6.

Rationale of the algorithm
Equation (30) can be regarded as an approximation of the Levenberg-Marquardt method (e.g., Nocedal and Wright, 2006) for maximizing the log-likelihood function in Eq. (23) within the subspace spanned by x (30) can be regarded as an approximation of the Gauss-Newton method. Indeed, Sakov (2013, 2014) derived a similar algorithm as an approximation of the Levenberg-Marquardt method or the Gauss-Newton method. However, the Levenberg-Marquardt method basically requires the Jacobian of the function g k , G m−1 . Since the above iterative algorithm does not directly use G m−1 , it would not be trivial to determine how the convergence of this algorithm is achieved. This issue is explored in this section.
We hereinafter assume that g(x 0 ) is at least twice differentiable. The Taylor expansion up to the second-order term of J becomes the following: where G m−1 is the Jacobian at x 0,m−1 , and ∇ 2 g is a thirdorder tensor which consists of the Hessian matrix of each element of the vector-valued function g(x 0 ). As in Eq. (12), 98 S. Nakano: Ensemble variational method we assume the following: where X m−1 is obtained by Eq. (27) given the ensemble x 0,m−1 . We then have the following: In practical cases, the Jacobian matrix G m−1 is typically unavailable. Ensemble variational methods thus employ the first-order approximation in Eq. 15 for G m−1 X m−1 ; that is, in the following: where, in the following: To evaluate this approximation when x 0 has a large uncertainty, we consider the following expansion of g(x 0 ) for each ensemble member x If we consider a vector m−1 w m , it becomes the following: If G m−1 X m−1 w m , which is contained in the first-order term in Eq. (39), is approximated by m−1 w m , then this means that the second-and higher-order terms of the right-hand side of Eq. (43) are neglected. Indeed, this can be justified if the spread of the ensemble is taken to be small. In our iterative scheme, the ensemble spread can be tuned freely. Even if the scale of x (i) 0 − x 0,m−1 is very small, any x 0 , which may have a large uncertainty, can be represented by taking the scale of w m to be large according to Eq. (38). The nonlinear terms of the right-hand side of Eq. (43) are of the order of w m , while they are of the order of x (i) 0 − x 0,m−1 2 or higher order. Thus, if the spread of the ensemble is taken to be small, the nonlinear terms of Eq. (43) would be suppressed, and we obtain the following: Consequently, we can apply the approximation in Eq. (40) to Eq. (39). Defining a function J ,w m (w m ) as follows: J ,w m (w m ) gives an approximation of J (x 0,m−1 +X m−1 w m ) in Eq. (39) as follows: The fourth term on the right-hand side of Eq. (45) would not necessarily be suppressed, even if the ensemble variance were taken to be small, because it is of the order of w m 2 and of the order of x (i) 0 − x 0,m−1 2 . To control the effect of this term, we introduce the idea of the minorize-maximize algorithm (MM algorithm; Lange et al., 2000;Lange, 2016). The MM algorithm is a class of iterative algorithms which considers a surrogate function which minorizes the objective function φ(z) and maximizes the surrogate function. Although the Levenberg-Marquardt method can also be regarded as an instance of the MM algorithm, the generic idea of the MM algorithm gives striking insight into the behavior of the algorithm.
At the mth step of the MM algorithm, the surrogate function, given the (m − 1)th estimate z m−1 , ψ(z 0 |z m−1 ), is chosen to satisfy the following conditions: The mth estimate, z m , is obtained by maximizing the mth surrogate function, ψ(z|z m−1 ). Since z m obviously satisfies the following: it is guaranteed that the mth estimate is as good as or better than the (m − 1)th estimate. After iterations, z m converges to a stationary point z s of the objective function φ(z) (Lange, 2016). If the Hessian matrix of φ(z) is negative definite in a neighborhood of z s , the stationary point z s becomes a local maximum (e.g., Nocedal and Wright, 2006). Therefore, the estimate would monotonically converge to a local maximum of φ(z) by repeating iterations if the following conditions are met: the surrogate function ψ(z|z m ) is twice differentiable and satisfies Eqs. (47a) and (47b), and the Hessian of φ(z) is a negative definite in a neighborhood of the stationary point z s .
Here we consider the following surrogate function J † ,w m : which is a similar treatment to Böhning and Lindsay (1988). For a given , we can take σ 2 m so that the following inequality holds over w m < : where the equality holds if w m = 0. If σ 2 m is chosen so that the inequality (50) is satisfied, J † ,w satisfies the following: This means that J † ,w m can be used as a surrogate function for maximizing J ,w m (w m ) according to the MM algorithm. Since Eq. (49) is the same as Eq. (29), the maximum of J † ,w m is achieved when w m =ŵ m whereŵ m is given by Eq. (30). Obviously,ŵ m satisfies the following inequality: and therefore, if the approximation in Eq. (40) is valid, then we obtain the following result: where x 0,m is given by Eq. (31). The above discussion is valid regardless of the choice of the ensemble x  (34), then the ensemble is confined within a particular subspace spanned by the initial ensemble, and x 0,m−1 would converge to a stationary point in this subspace. According to Eq. (37), if the nonlinearity of g is not severe when ∇ 2 g is not dominant, the Hessian of J is negative definite in a region where y − g(x 0,m−1 ) is small enough. This suggests that the iterative algorithm in Sect. 4 would attain at least a local maximum of J in the subspace for weakly nonlinear problems if Eq. (36) is applicable. If the ensemble is updated according to Eq. (36), a stationary point is sought in a different subspace in each iteration. If Q m is full rank, J would increase until a point which can be regarded as a stationary point in any N-dimensional subspace, and x 0,m−1 would thus converge to a local maximum in the full vector space after infinite iterations.
Based on the foregoing, convergence to a local maximum of the objective function J can be achieved for weakly nonlinear systems if the ensemble variance is taken to be small enough. If the ensemble with large spread is used, the estimate can be biased due to the nonlinear terms in Eq. (43). Hence, an ensemble with small spread would provide a satisfactory result for weakly nonlinear systems where we can assume the Hessian of J is a negative definite over the region of interest. However, this iterative algorithm does not necessarily guarantee convergence to the global maximum. If there are multiple peaks in J , it might be effectual to start with an ensemble with a large spread to approach the global maximum. An ensemble with a large spread would grasp a large-scale structure of the objective function because the ensemble approximation of the Jacobian gives the gradient averaged over the region where the ensemble members are distributed under a certain assumption (Raanes et al., 2019). Even if the spread is taken to be large at first, convergence would eventually be achieved by reducing the ensemble spread in each iteration, as described in the previous section.

S. Nakano: Ensemble variational method
Our formulation refers to the result of a simulation run initialized at the (m−1)th estimate x 0,m−1 for obtaining the mth estimate in Eq. (31). On the other hand, in many studies, the ensemble mean of simulation runs g(x (i) 0,m−1 ) is used as a substitute for g(x 0,m−1 ). If the ensemble mean g(x (i) 0,m−1 ) is used, the ensemble in each iteration must be generated so as to satisfy Eq. (26), which is not required in our formulation; that is, the ensemble mean must be equal to the (m − 1)th estimate x 0,m−1 . It should also be kept in mind that some bias due to the nonlinear terms in Eq. (42) could be introduced when the ensemble mean is used instead of g(x 0,m−1 ). However, this bias could be suppressed by taking the ensemble spread to be small. Since the use of the ensemble mean would save the computational cost of one simulation run for each iteration, it might be a useful treatment for practical applications.
It is also important to appropriately choose the parameter σ 2 m . A sufficiently large σ 2 m guarantees that the mth estimate x 0,m is better than the previous estimate x 0,m−1 , and hence, convergence is stable. However, convergence speed will be degraded with large σ 2 m because x 0,m is strongly constrained by the penalty weighted with σ 2 m . Although there is no definitive way to determine this parameter, σ 2 m /2 should have a similar scale to the right-hand side of Eq. (50); that is, if the third-and higher-order terms are assumed to be negligible then, in the following: Since the right-hand side of Eq. (55) contains ∇ 2 g which comes from a nonlinear term of the function g, σ 2 m should be taken larger as system nonlinearity is more severe. This equation also suggests that σ 2 m should depend on the discrepancy between observation y and the (m − 1)th prediction g(x 0,m−1 ). Although ∇ 2 g is unknown in general, y − g(x 0,m−1 ) could be used as a guide for determining σ 2 m . The parameter σ 2 m should also be dependent on the variance of the ensemble. If an ensemble with a large spread is used, σ 2 m should be set as large accordingly.

Bayesian form
The algorithm in Sect. 4 maximizes the log-likelihood function in Eq. (23). However, sometimes it would be required that prior information is incorporated into the estimate in a Bayesian manner. We thus consider the following logposterior density function as the objective function: which is the same as Eq. (10), although the vectors of the whole time sequence are combined into a single vector for each y and g(x 0 ) as in Eq. (23). Equation (56) is proportional to the log-posterior distribution when p(x 0 ) and p(y|x 0 ) are assumed to be Gaussian. The Taylor expansion of Eq. (56) is as follows: Applying Eqs. (38) and (44), we obtain the following approximate objective function: As per Eq. (50), we can take σ 2 m so that the fifth and sixth terms on the right-hand side of Eq. (58) can be minorized by a quadratic function −(σ 2 m /2)w T m w m , and we obtain the following surrogate function which minorizes the function J w m as follows: which satisfies the following conditions: This function is maximized when, in the following: The mth estimate for x 0 is obtained as follows: Similar to Eq. (54), we obtain the following: Thus, x 0,m is a better estimate than x 0,m−1 if the approximation in Eq. (44) is valid. Generating the (m + 1)th ensemble around x 0,m , we can obtain the (m + 1)th surrogate function according to Eq. (59) and proceed to the next iteration. There are various methods for updating the ensemble including the methods mentioned in Sect. 4. Eq. (32) or (34) is convenient for practical problems because we can avoid computing the inverse of P 0,b in Eq. (62). When Eq. (32) is used for updating the ensemble, we can easily avoid computing the inverse of P 0,b by drawing initial ensemble members from the prior distribution N (x 0 ; x 0,b , P 0,b ). If initial ensemble members x (1) 0,0 , . . ., x (N) 0,0 obey the prior distribution, we can use the same approximation as Eq. (13); that is, in the following: Applying Eq. (63) recursively, x 0,m−1 can be reduced to the following: Inserting Eqs. (32) and (66) into Eq. (62) and applying Eq. (65), we obtain the following: Thus, we can avoid computing the inverse of P 0,b . Likewise, when Eq. (34) is used for updating the ensemble, we can apply Eq. (65) to avoid computing the inverse of P 0,b (See the Appendix).
As described in the previous section, the use of Eq. (32) of (34) confines the estimate x 0,m−1 within the subspace spanned by the initial ensemble. On the other hand, Eq. (36) enables us to seek the optimal value of x 0 in a different subspace in each iteration. We can then obtain the local maximum in the full vector space if Q m is taken to be full rank. It appears that a similar approximation to Eq. (67) is applicable if Q m is taken to be a scalar matrix of P 0,b . However, since this approximation considers a different approximate objective function in each iteration, monotonic convergence is not guaranteed. In order to ensure monotonic convergence, Eq. (36) requires the inverse of P 0,b in general. Nonetheless, if P −1 0,b can be obtained, the method with Eq. (36) would be helpful for improving the estimate.

Experiments
Preceding studies have already demonstrated the usefulness of the ensemble-based iterative algorithms for various data assimilation problems. Estimation with the ensemble update 102 S. Nakano: Ensemble variational method in Eq. (32) has been verified in detail (e.g., Bocquet and Sakov, 2014). The iterative algorithm ensemble update in Eq. (34) has also been demonstrated (e.g., Minami et al., 2020). Although it might not be necessary to show the ability of the ensemble-based iterative algorithm further, we here verify some properties suggested in the above discussion through twin experiments with a simple model rather than a practical model.
In this section, we employ the Lorenz 96 model (Lorenz and Emanuel, 1998), which is written by the following equations: for m = 1, . . ., M, where x −1 = x M−1 , x 0 = x M , and x M+1 = x 1 . The state dimension M was taken to be 40, and the forcing term f was taken to be 8. The true scenario was generated by running the model with a certain initial state. We here consider a weakly nonlinear problem. The assimilation window was accordingly taken as a short time interval 0 < t ≤ 8. It was assumed that all the state variables could be observed with a fixed time interval ( t = 0.1), and hence, 80 data were generated for each state variable. The observation noise for each variable was assumed to independently follow a Gaussian distribution with mean 0 and standard deviation 0.5. In each data assimilation experiment, the prior distribution was assumed to be a Gaussian distribution with mean 0 and variance ζ 2 I, N (0, ζ 2 I), where ζ = 5. We compare two ensemble updating methods of Eqs. (32) and (36). In applying Eq. (32), the initial ensemble x (1) 0,0 , . . ., x (N) 0,0 was drawn from a Gaussian distribution N (0, ε 2 I), where ε = 5 × 10 −6 and X 0 was obtained as follows: The matrix X m for each iteration was fixed at X m = X 0 , which corresponds to the setting in Eq. (32) with α m = 1. The discussion in Sect. 5 suggests that the penalty parameter σ 2 m should be determined according to Eq. (55). Although ∇ 2 g is unknown, we can say that σ 2 m should be related with the variance of the ensemble and the discrepancy between y and g(x 0,m−1 ). We thus gave σ 2 m as follows: where we tried two cases with δ = 1.5 × 10 −3 and δ = 1.5 × 10 −2 . Here the part of the square root of a quadratic form of y K − g K (x 0,m−1 ) was multiplied in order that σ 2 m was roughly proportional to y − g(x 0,m−1 ) , and tr T m−1 R −1 m−1 was for representing the variance of the ensemble.   Figures 1 and 2 show results with Eq. (32), where δ = 1.5 × 10 −3 and δ = 1.5 × 10 −2 , respectively. We took the ensemble size N to be 30, which is less than the state dimension, and performed the estimation 20 times with different seeds of a pseudo random number generator. The value of the objective function J in Eq. (56) for each iteration is plotted for each of 20 trials in these figures. When δ = 1.5 × 10 −3 , the value of J tended to increase more sharply than when δ = 1.5 × 10 −2 . However, J did not monotonically increase when δ = 1.5×10 −3 , while it monotonically increased when δ = 1.5×10 −2 . According to the discussion in Sect. 5, monotonic convergence is achieved when σ 2 m is taken to be large enough. However, convergence speed becomes slow when σ 2 m is large. The results in Figs. 1 and 2 thus confirmed our discussion on the convergence. However, the results shown in Figs. 1 and 2 did not converge to the same value, which means the results depended on the seeds of pseudo random numbers. This would indicate that a local maximum within a subspace spanned by the ensemble does not match the maximum in the full state vector space, and that the value of the local maximum depends on the subspace. Figures 3 and 4 show results with Eq. (36), where δ = 1.5 × 10 −3 and δ = 1.5 × 10 −2 , respectively. Again, the ensemble size N was taken to be 30, and the results of 20 trials with different seeds of pseudo random numbers are overplotted. Again, when δ = 1.5 × 10 −3 , the increase in J tended to be sharp, while it was not monotonic. On the other hand, when δ = 1.5×10 −2 , the increase in J was gradual but monotonic. In contrast with the results in Figs. 3 and 4, the values of J in different trials converged to the same value after about 15 iterations, as in the case with δ = 1.5 × 10 −3 shown in Fig. 3. In the case with δ = 1.5 × 10 −2 , the convergence was much slower, but the values of J converged to the same value as the case with δ = 1.5 × 10 −2 after about 80 iterations in all of the 20 trials (not shown). These results show that the maximum of the objective function in the full vector space can be reached by changing an ensemble in each iteration even if the ensemble does not span the full vector space. Figure 5 shows the convergence of the estimated time series to the true time series for one of the 40 state variables, x 1 , in one of the 20 trials in Fig. 4. The red line indicates the initial guess obtained by running the Lorenz 96 model started at x 0,0 . The yellow, green, and blue lines show the estimates after the second, 10th, and 30th iterations, respectively. The truth is indicated with the gray line, and the time series of the synthetic observations used in this experiment are overplotted with gray dots. Although the initial trajectory (red line) showed was obviously dissimilar to the true trajectory (gray line), the estimate was improved by repeating the iterations as also shown in Fig. 4. After 30 iterations, the es-  timate was very close to the truth, and the temporal evolution was well reproduced. Figure 6 shows the root mean square errors of the estimates, which means the root mean squares of the differences between the estimates and the true values over all the 40 variables at each time step in the same trial as in Fig. 5. Again, the red line indicates the initial guess, and the yellow, green, and blue lines show the estimates after the second, 10th, and 30th iterations, respectively. The errors were certainly reduced over the period of the assimilation by the iterations.
In order to closely investigate the effect of σ 2 m , we conducted additional experiments for a case in which nonlinearity is a little stronger. While Figs. 3 and 4 show the results when the assimilation window was taken as 0 < t ≤ 8, Figs. 7 and 8 show the results with a little longer assimilation window, 0 < t ≤ 10. Although the other settings were the same  as Figs. 3 and 4, the effect of the nonlinearity on the objective function J was a little more severe due to the longer assimilation window. When δ = 1.5×10 −3 , the J value converged to about −2000 in many of the 20 trials. In some trials, however, J did not converge but oscillated below −6000. In contrast, when δ was as large as 1.5 × 10 −2 , the J value converged to the same value after about 50 iterations in all of the 20 trials. As discussed in Sect. 5, a sufficiently large σ 2 m guarantees that the estimate is improved in each iteration. Although the convergence speed becomes worse, a stable estimation can be attained.
We also conducted experiments with a higher-dimensional system. The method with a randomly generated ensemble was applied to the Lorenz 96 model with 400 variables (M = 400), of which the dimension is 10 times higher. Figure 9 shows the result with 400 variables. The assimilation Figure 8. The value of the objective function J for each iteration for 20 trials of the estimation. The ensemble was updated using Eq. (36) with δ = 1.5 × 10 −2 , and the ensemble window was taken as 0 < t ≤ 10. Figure 9. The value of the objective function J for each iteration for 20 trials of the estimation for the 400 dimensional system. The ensemble was updated using Eq. (36) with δ = 1.5 × 10 −3 . was taken as 0 < t ≤ 8, and δ was set at 1.5 × 10 −3 , which is the same as the result in Fig. 3. The ensemble size N was taken to be 200. For the assimilation into the Lorenz 96 model with 400 variables, the convergence was attained in about 20 iterations with 200 ensemble members. In this highdimensional case, monotonic convergence was achieved even if δ was taken to be as small as in Fig. 3. As far as we conducted experiments with the Lorenz 96 model with various dimensions, the convergence becomes stabler as the state dimension M becomes higher. This might imply that the nonlinear term (the fifth term of the right-hand side of Eq. 58) is depressed in the high-dimensional Lorenz 96 models. However, we have not resolved the reason for the stable convergence for the high-dimensional Lorenz 96 systems at present.  In Fig. 10, we confirmed the convergence of the estimated time series to the true trajectory for one of the 400 state variables, x 1 , in one of the 20 trials in Fig. 9. The red line indicates the initial guess started at x 0,0 , and the yellow, green, and blue lines shows the estimates after the second, 10th, and 30th iterations, respectively. The gray line shows the truth, and the gray dots show the synthetic observations used in this experiment. Similar to Fig. 5, the estimate approached to the truth by repeating the iterations even in this high-dimensional case, and the true evolution was well reproduced after 30 iterations. In Fig. 11, the root mean square errors over all the 400 variables at each time step are in the same trial as Fig. 10. Each color corresponds to the respective iteration shown in Fig. 10. It is confirmed that the errors decreased over the period of the assimilation after repeating the iterations.

Discussion and conclusions
The ensemble variational method is derived under the assumption that a linear approximation of a dynamical system model is valid over a range of uncertainty. This linear approximation is not valid in such problems where the scale of uncertainty is much larger than the range of linearity. However, a local maximum of the log-likelihood or log-posterior function can be attained by updating the ensemble iteratively -even in cases with a large uncertainty. The present paper assessed the influence of system nonlinearity on this iterative algorithm after considering the nonlinear terms of the system function g. The discussion suggests two points to guarantee the monotonic convergence to a local maximum in the subspace spanned by the ensemble. One is that the ensemble spread must be set to be small, and the other is that the penalty parameter σ 2 m must be set to be large enough. A sufficiently large σ 2 m would ensure monotonic convergence, although convergence speed would become poorer with too large a σ 2 m . The effect of this penalty term has also been experimentally confirmed in Sect. 7. These properties would be reasonable if this iterative algorithm is regarded as an approximation of the Levenberg-Marquardt method.
In applying the iterative algorithm discussed in this paper, the choice of the parameter σ 2 m would be an important issue. Although it was determined according to Eq. (70) in Sect. 7, Eq. (70) still requires tuning of the parameter δ. However, it is not necessary to finely tune δ because δ would not have a crucial effect on the performance of the algorithm. It would thus be enough to roughly determine δ. In addition, one could check whether the objective function J increases or not at each iteration just by running one forward simulation initialized at x 0,m . In the iterative algorithm, the most computational cost is spent on running the ensemble simulation with multiple initial states. The pilot run, which is computationally much cheaper than the ensemble run, would be a feasible way of tuning δ in practical cases.
One issue peculiar to the ensemble-based method is the rank deficiency, which occurs when the ensemble size is smaller than the dimension of the initial state x 0 . If the ensemble is confined within a particular subspace, the iterative algorithm can only attain the optimal value within the subspace spanned by the ensemble. However, our discussion indicates that, if σ 2 m is sufficiently large, then it is ensured that the discrepancies between estimates and observations are reduced in each iteration -even if the ensemble is confined within a subspace. If the ensemble is updated so as to span a different subspace in each iteration, as indicated in Eq. (36), the optimal solution would be sought in a different subspace in each iteration, and the estimate would converge to a local maximum in the full vector space after infinite iterations. It should be noted that this paper has not assessed how well the method with a random ensemble generation with Eq. (36) works in practical high-dimensional problems, although it is, in theory, applicable as far as the inverse of P 0,b is available.

S. Nakano: Ensemble variational method
Further research would be required to clarify its performance in high-dimensional geophysical models in order to reinforce this study.
Compared with the adjoint method, which is a conventional variational method for 4D variational problems, the convergence rate of this iterative method would be poorer because it employs an ensemble approximation within a lowerdimensional subspace at each iteration. Nonetheless, we can say that the iterative ensemble-based method is potentially useful because it is much easier to implement. While the adjoint method requires an adjoint code which is usually timeconsuming to develop, the ensemble-based method can solve the same problem without requiring an adjoint code. This paper mainly considers data assimilation problems. However, the framework of the iterative ensemble variational method is also applicable to general nonlinear inverse problems as far as the Gaussian assumption in Eq. (23) or Eq. (56) is upheld. If an ensemble of the results of forward runs is available, many practical problems can readily be addressed. This method could therefore be a promising tool for data assimilation and various inverse problems.

Appendix A: Algorithm for Bayesian estimation with ensemble transform
In the following, it is described how the iteration can be performed without computing the inverse of P 0,b when the ensemble is updated with the ensemble transform scheme in Eq. (34). When the ensemble is updated by the ensemble transform in the manner of Eq. (34) as follows: the transform matrix T m should be given as follows: where U m and m are obtained by the following eigenvalue decomposition: If X m−1 is obtained according to Eq. (A1), as follows: Defining the matrix C m−1 as follows: X m−1 can be written as follows: If the initial ensemble is sampled from the prior distribution N (x 0 ; x 0,b , P 0,b ), then we can apply Eq. (65) again. Using Eqs. (65) and (A6), the term X T m−1 P −1 0,b X m−1 in Eq. (62) can be reduced to the following: The mth estimate is broken down as follows: where C 0 = I. Defining a vector ξ m−1 is as follows: then Eq. (A8) becomes the following: x 0,m−1 = x 0,b + X 0 ξ m−1 , and we obtain the following: Using Eqs. (65) and (A11), we can rewrite Eq. (62) into a form without the inverse of P 0,b as follows: The algorithm with the ensemble transform is summarized in Algorithm 2.
Competing interests. The author declares that there is no conflict of interest.
Financial support. This research has been supported by the Japan Society for the Promotion of Science (grant no. 17H01704) and by PRC JSPS CNRS, the Bilateral Joint Research Project "Forecasting the geomagnetic secular variation based on data assimilation".
Review statement. This paper was edited by Amit Apte and reviewed by three anonymous referees.