Ensemble Variational Assimilation as a Probabilistic Estimator . Part I : The linear and weak non-linear case

Data assimilation is considered as a problem in Bayesian estimation, viz. determine the probability distribution for the state of the observed system, conditioned by the available data. In the linear and additive Gaussian case, a Monte-Carlo sample of the Bayesian probability distribution (which is Gaussian and known explicitly) can be obtained by a simple procedure : perturb the data according to the probability 5 distribution of their own errors, and perform an assimilation on the perturbed data. The performance of that approach, called here Ensemble Variational Assimilation (EnsVAR) also known as Ensemble of Data Assimilations or (EDA) , is studied in this two-part paper on the non-linear low-dimensional Lorenz-96 chaotic system, the assimilation being performed by the standard variational procedure. In this first part, EnsVAR is implemented first, for reference, in a linear and Gaussian case, and then in a weakly non-linear 10 case (assimilation over 5 days of the system). The performances of the algorithm, considered either as a probabilistic or a deterministic estimator, are very similar in the two cases. Additional comparison shows that the performance of EnsVAR is better, both in the assimilation and forecast phases, than that of standard algorithms for the ensemble Kalman Filter and Particle Filter (although at a higher cost). Globally similar results are obtained with the Kuramoto-Sivashinsky equation. 15


Introduction
The purpose of assimilation of observations is to reconstruct as accurately as possible the state of the system under observation, using all the relevant available information.In geo-physical fluid applications, such as meteorology or oceanography, that relevant information essentially consists of the physical observations and of the physical laws which govern the evolution of the atmosphere or the ocean.Those physical laws are in practice available in the form of a discretized numerical model.Assimilation is therefore the process by which the observations are combined together with a numerical model of the dynamics of the observed system in order to obtain an accurate description of the state of that system.
All the available information, the observations as well as the numerical model, is affected (and, as far as we can tell, will always be affected) with some uncertainty, and one may wish to quantify the resulting uncertainty in the output of the assimilation process.If one chooses to quantify uncertainty in the form of probability distributions (see e.g.Jaynes, 2004, or Tarantola, 2005, for a discussion of the problems which underlie that choice), assimilation can be stated as a problem in Bayesian estimation.Namely, determine the probability distribution for the state of the observed system, conditioned by the available information.That statement makes sense only under the condition that the available information is described from the start in the form of probability distributions.We will not discuss here the difficult problems associated with that condition (see Tarantola, 2005, for such a discussion) and will assume below that it is verified.
There is one situation in which the Bayesian probability distribution is readily obtained in analytical form.That is when the link between the available information on the one hand, and the unknown system state on the other, is linear, and affected by additive Gaussian error.The Bayesian probability distribution is then Gaussian, with explicitly known expectation and covariance matrix (see Sect. 2 below).
Published by Copernicus Publications on behalf of the European Geosciences Union & the American Geophysical Union.Now, the very large dimension of the numerical models used in meteorology and oceanography (that dimension can lie in the range 10 6 to 10 9 ) forbids explicit description of probability distributions in the corresponding state spaces.A widely used practical solution is to describe the uncertainty in the form of an ensemble of points in state space, with the dispersion of the ensemble being meant to span the uncertainty.Two main classes of algorithms for ensemble assimilation exist at present.The ensemble Kalman filter (EnKF), originally introduced by Evensen (1994) and further studied by many authors (Evensen, 2003and Houtekamer and Mitchell, 1998, 2001), is a heuristic extension to large dimensions of the standard Kalman filter (KF) Kalman (1960).The latter exactly achieves Bayesian estimation in the linear and Gaussian case that has just been described.It explicitly determines the expectation and covariance matrix of the (Gaussian) conditional probability distribution and evolves those quantities in time, updating these with new observations as they become available.
The EnKF, contrary to the standard KF, evolves an ensemble of points in state space.One advantage is that it can be readily, if empirically, implemented on non-linear dynamics.On the other hand, it keeps the same linear Gaussian procedure as KF for updating the current uncertainty with new observations.EnKF exists in many variants and, even with ensemble sizes of relatively small size (O(10-100)), produces results of high quality.It has now become, together with variational assimilation, one of the two most powerful algorithms used for assimilation in large-dimension geophysical fluid applications.
Concerning the Bayesian properties of EnKF, Le Gland et al. (2011) have proven that, in the case of linear dynamics and in the limit of infinite ensemble size, EnKF achieves Bayesian estimation, in that it determines the exact (Gaussian) conditional probability distribution.In the case of nonlinear dynamics, EnKF has a limiting probability distribution, which is not in general the Bayesian conditional distribution.
Contrary to EnKF, which was from the start developed for geophysical applications (but has since extended to other fields), particle filters (PFs) have been developed totally independently of such applications.They are based on general Bayesian principles and are thus independent of any hypothesis of linearity or Gaussianity (see Doucet et al., 2000, 2001, and van Leeuwen, 2017, for more details).Like the EnKF, they evolve an ensemble of (usually weighted) points in state space and update them with new observations as these become available.They exist in numerous variants, many of which have been mathematically proven to achieve Bayesianity in the limit of infinite ensemble size (Crisan and Doucet, 2002).On the other hand, no results exist to the authors' knowledge in the case of finite ensemble size.They are actively studied in the context of geophysical applications as presented in van Leeuwen (2009van Leeuwen ( , 2017)), but have not at this stage been operationally implemented on largedimension meteorological or oceanographical models.
There exist at least two other algorithms that can be utilized to build a sample of a given probability distribution.The first one is the acceptance-rejection algorithm described in Miller et al. (1999).The other one is the Metropolis-Hastings algorithm (Metropolis et al., 1953), which itself possesses a number of variants (Robert, 2015).These algorithms can be very efficient in some circumstances, but it is not clear at this stage whether they could be successfully implemented in large-dimension geophysical applications.
Coming back to the linear and Gaussian case, not only, as said above, is the (Gaussian) conditional probability distribution explicitly known, but a simple algorithm exists for determination of independent realizations of that distribution.In succinct terms, perturb additively the data according to their own error probability distribution, and perform the assimilation for the perturbed data.Repetition of this procedure on successive sets of independently perturbed data produces a Monte Carlo sample of the Bayesian posterior distribution.
The present work is devoted to the study of that algorithm, and of its properties as a Bayesian estimator, in non-linear and/or non-Gaussian cases.Systematic experiments are performed on two low-dimensional chaotic toy models, namely the model defined by Lorenz (1996) and the Kuramoto-Sivashinsky (K-S) equation (Kuramoto andTsuzuki, 1975, 1976).Variational assimilation, which produces the Bayesian expectation in the linear and Gaussian case, and is routinely, and empirically, implemented in non-linear situations in operational meteorology, is used for estimating the state vector for given (perturbed) data.The algorithm is therefore called ensemble variational assimilation, abbreviated to EnsVAR.This algorithm is not new.There exist actually a rather large number of algorithms for assimilation that are variational (at least partially) and build (at least at some stage) an ensemble of estimates of the state of the observed system.A review of those algorithms has been recently given by Bannister (2017).Most of these algorithms are actually different from the one that is considered here.They have not been defined with the explicit purpose of achieving Bayesian estimation and are not usually evaluated in that perspective.
EnsVAR, as defined here, has been specifically studied under various names and in various contexts by several authors (Oliver et al., 1996;Bardsley, 2012;Bardsley et al., 2014;Liu et al., 2017).Bardsley et al. (2014) have extended it in to what they call the randomize-then-optimize (RTO) algorithm.These works have shown that EnsVAR is not in general Bayesian in the non-linear case, but can nevertheless lead to a useful estimate.
EnsVAR is also used operationally at the European Centre for Medium-Range Weather Forecasts (ECMWF) (Isaksen et al., 2010) in the definition of the initial conditions of ensemble forecasts.It is also used, both at ECMWF and at Météo-France (see respectively Bonavita et al., 2016, andBerre et al., 2015), under the name ensemble of data assimilations (EDA) for defining the background error covariance matrix of the variational assimilation system.And ECMWF, in its latest reanalysis project ERA5 (Hersbach and Dee, 2016) uses a low-resolution ensemble of data assimilations system in order to estimate the uncertainty in the analysis.
None of the above ensemble methods seems however to have been systematically and objectively evaluated as a probabilistic estimator.That is precisely the object of the present two papers.
The first of these is devoted to the exactly linear and weakly non-linear cases, and the second to the fully nonlinear case.In this first one, Sect. 2 describes in detail the EnsVAR algorithm, as well as the experimental set-up that is to be used in both parts of the work.Section 3 describes the statistical tests to be used for objectively assessing EnsVAR as a probabilistic estimator.EnsVAR is implemented in Sect.4, for reference, in an exactly linear and Gaussian case in which theory says it achieves exact Bayesian estimation.It is implemented in Sect. 5 on the non-linear Lorenz system, over a relatively short assimilation window (5 days), over which the tangent linear approximation remains basically valid and the performance of the algorithm is shown not to be significantly altered.Comparison is made in Sect.6 with two standard algorithms for EnKF and PF.Experiments performed on the Kuramoto-Sivashinsky equation are summarized in Sect.7. Partial conclusions, valid for the weakly non-linear case, are drawn in Sect.8.
The second part is devoted to the fully non-linear situation, in which EnsVAR is implemented over assimilation windows for which the tangent linear approximation is no longer valid.Good performance is nevertheless achieved through the technique of quasi-static variational assimilation (QSVA), defined by Pires et al. (1996) and Järvinen et al. (1996).Comparison is made again with EnKF and PF.
The general conclusion of both parts is that EnsVAR can produce good results which, in terms of performance as a probabilistic estimator and of numerical accuracy, are at least as good as the results of EnKF and PF.
In the sequel of the paper we denote by N (m, P) the multivariate Gaussian probability distribution with expectation m and covariance matrix P (for a univariate Gaussian probability distribution, we will use the similar notation N (m, r)).E will denote statistical expectation, and Var will denote variance.

The method of ensemble variational assimilation
We assume the available data make up a vector z, belonging to data space D with dimension N z , of the form In this expression, x is the unknown vector to be determined, belonging to state space S with dimension N x , while is a known linear operator from S into D, called the data operator and represented by an N z × N x matrix.As for the N z vector ζ , we will call it an "error", even though it may not represent an error in the usual sense, but any form of uncertainty.It is assumed to be a realization of the Gaussian probability distribution N (0, ) (in case the expectation E(ζ ) were nonzero, but known, it would be necessary to first unbias the data vector z by subtracting that expectation).It should be stressed that all available information about x is assumed to be included in the data vector z.For instance, if one, or even several, Gaussian prior estimates N (x b , P b ) are available for x, they must be introduced as subsets of z, each with N x components, in the form In those conditions the Bayesian probability distribution P (x|z) for x conditioned by z is the Gaussian distribution N (x a , P a ) with At first glance, the above equations seem to require the invertibility of the N z × N z matrix and then of the N x × N x matrix T −1 .Without going into full details, the need for invertibility of is only apparent, and invertibility of T −1 is equivalent to the condition that the data operator is of rank N x .This in turn means that the data vector z contains information on every component of x.This condition is known as the determinacy condition.It implies that N z ≥ N x .We will call p = N z − N x the degree of over-determinacy of the system.
The conditional expectation x a can be determined by minimizing the following scalar objective function defined on state space S In addition, the covariance matrix P a is equal to the inverse of the Hessian of J In the case where the error ζ , while still being random with expectation 0 and covariance matrix , is not Gaussian, the vector x a defined in Eq. ( 2) is not the conditional expectation of x for a given z, but only the least-variance linear estimate, or best linear unbiased estimate (BLUE), of x from z.Similarly, the matrix P a is no longer the conditional covariance matrix of x for a given z, but the covariance matrix of the estimation error associated with the BLUE, averaged over all realizations of the error ζ .
Minimization of Eq. ( 3) can also been performed, at least in favourable circumstances, with a non-linear data operator .This is what is done, heuristically but with undisputable usefulness, in meteorological and oceanographical variational assimilation.The latter is routinely implemented in a number of major meteorological centres on non-linear dynamical models with non-linear observation operators.For more on minimization of objective functions of Eq. ( 3) with non-linear , see e.g.Chavent (2010).
Coming back to the linear and Gaussian case, consider the perturbed data vector z = z + ζ , where the perturbation ζ has the same probability distribution N (0, ) as the error ζ .It is easily seen that the corresponding estimate is distributed according to the Gaussian posterior distribution N (x a , P a ) (Eq. 2).This defines a simple algorithm for obtaining a Monte Carlo sample of that posterior distribution.Namely, perturb the data vector z according to its own error probability distribution, compute the corresponding estimate (Eq.5), and repeat the same process with independent perturbations on z.
That is the ensemble variational assimilation, or EnsVAR, algorithm that is implemented below in non-linear and non-Gaussian situations, with the analogue of the estimate x a being computed by minimization of Eq. (3).In general, this procedure, as already mentioned in the introduction, does not achieve Bayesian estimation, but it is interesting to study the properties of the ensembles thus obtained.
Remark.In the case when, the data operator being linear, the error ζ in Eq. ( 1) is not Gaussian, the quantity x a defined by Eq. ( 5) has expectation x a (BLUE) and covariance matrix P a (see Isaksen et al., 2010).The probability distribution of the x a is in general not Bayesian, but it has the same expectation and covariance matrix as the Bayesian distribution corresponding to a Gaussian ζ .
All the experiments presented in this work are of the standard identical twin type, in which the observations to be assimilated are extracted from a prior reference integration of the assimilating model.And all experiments presented in this first part are of the strong-constraint variational assimilation type, in which the temporal sequence of states produced by the assimilation are constrained to satisfy exactly the equations of the assimilating model.
That model, which will emanate from either the Lorenz or the Kuramoto-Sivashinsky equation, will be written as where x t is the model state at time t, belonging to model space M, with dimension N (in the strong-constraint case considered in this first part, the model space M will be identical with the state space S).For each model, a "truth", or reference, run x r t has first been produced.A typical (strongconstraint) experiment is as follows.
Choosing an assimilation window [t 0 , t T ] with length T (it is mainly the parameter T that will be varied in the experiments), synthetic observations are produced at successive times (t 0 < t 1 < . . .< t k < . . .< t K = t T ), of the form where H k is a linear observation operator, and k ∼ N (0, R k ) is an observation error.The k 's are taken to be mutually independent.
The following process is then implemented N ens times i. Perturb the observations y k , k = 0, . .., K according to where δ k ∼ N (0, R k ) is an independent realization of the same probability distribution that has produced k .The notation stresses, as in Eq. ( 5), the perturbed character of (y iens k ) .
ii. Assimilate the perturbed observations y iens k by minimization of the following objective function: where ξ k is the value at time t k of the solution of Eq. ( 6) emanating from ξ 0 .
The objective function (Eq.9) is of type (Eq.3), with the state space S being the model space M (N = N x ) and the data vector z consisting of the concatenation of the K + 1 perturbed data vectors y iens k .The process (i)-(ii), repeated N ens times, produces an ensemble of N ens model solutions over the assimilation window [t 0 , t T ].
In the perspective taken here, it is not the properties of those individual solutions that matter the most, but the properties of the ensemble considered as a sample of a probability distribution.
The ensemble assimilation process, starting from Eq. ( 7), is then repeated over N win assimilation windows of length T (taken sequentially along the true solution x r t ).In variational assimilation as it is usually implemented, the objective function to be minimized contains a so-called background term at the initial time t 0 of the assimilation window.That term consists, together with an associated error covariance matrix, of a climatological estimate of the model state vector, or of a prior estimate of that vector at time t 0 coming from assimilation of previous observations.An estimate of the state vector at t 0 is explicitly present in Eq. ( 9), in the form of the perturbed observation y iens 0 .But that is not a background term in the usual sense of the expression.In particular, no cycling of any type is performed from one assimilation window to the next.The question of a possible cycling of ensemble variational assimilation will be discussed in Part 2 (Jardak and Talagrand, 2018).
The covariance matrix R k in Eq. ( 9) is the same as the covariance matrix of the perturbations δ k in Eq. ( 8).The situation in which one used in the assimilation assumed statistics for the observation errors that were different from the real statistics has not been considered.
We sum up the description of the experimental procedure and define precisely the vocabulary to be used in the sequel.The output of one experiment consists of N win ensemble variational assimilations.Each ensemble variational assimilation produces, through N ens minimizations of form (Eq. 9), or individual variational assimilations, an ensemble of N ens model solutions corresponding to one set of observations y k (k = 0, • • •, K) over one assimilation window.These model solutions will be simply called the elements of the ensemble.The various experiments will differ through various parameters and primarily the length T of the assimilation windows.
The minimizations (Eq.9) are performed through an iterative limited-memory BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm (Nocedal and Wright, 2006), started from the observation y 0 at time t 0 (which, as said below, is taken here as bearing on the entire state vector x r 0 ).Each step of the minimization algorithm requires the explicit knowledge of the local gradient of the objective function J iens with respect to ξ 0 .That gradient is computed, as usual in variational assimilation, through the adjoint of Eq. ( 6).Unless specified otherwise, the size of the assimilation ensembles will be N ens = 30, and the number N win of ensemble variational assimilations for one experiment will be equal to 9000.

The validation procedure
We recall the general result that, among all deterministic functions from data space into state space, the conditional expectation z → E(x|z) minimizes the variance of the estimation error on x.
What should ideally be done here for the validation of results is objectively assess (if not on a case-by-case basis, at least in a statistical sense) whether the ensembles produced by EnsVAR are samples of the corresponding Bayesian probability distributions.In the present setting, where the probability distribution of the errors k in Eq. ( 7) is known, and where a prior probability distribution is also known, through the observation y 0 , for the state vector x 0 , one could in principle obtain a sample of the exact Bayesian probability distribution by proceeding as follows.
Through repeated independent realizations of the process defined by Eqs. ( 6) and ( 7), build a sample of the joint probability distribution for the couple (x, z).That sample can then be read backwards for a given z and, if large enough, will produce a useful sample estimate of the corresponding Bayesian probability distribution for x.That would actually solve numerically the problem of Bayesian estimation.But it is clear that the sheer numerical cost of the whole process, which requires explicit exploration of the joint space (x, z), makes this approach totally impossible in any realistic situation.
We have evaluated instead the weaker property of reliability (also called calibration).Reliability of a probabilistic estimation system (i.e. a system that produces probabilities for the quantities to be estimated) is the statistical consistency between the predicted probabilities and the observed frequencies of occurrence.
Consider a probability distribution π (the words probability distribution must be taken here in the broadest possible sense, meaning as well discrete probabilities for the occurrence of a binary or multi-outcome event, as continuous distributions for a one-or multi-dimensional random variable), and denote π (π ) the distribution of the reality in the circumstances when π has been predicted.Reliability is the property that, for any π, the distribution π (π ) is equal to π.
Reliability can be objectively evaluated, provided a large enough verification sample is available.Bayesianity clearly implies reliability.For any data vector z, the true state vector x is distributed according to the conditional probability distribution P (x|z), so that a probabilistic estimation system which always produces P (x|z) is reliable.The converse is clearly not true.A system which, ignoring the observations, always produces the climatological probability distribution for x will be reliable.It will however not be Bayesian (at least if, as one can reasonably hope, the available data bring more than climatological information on the state of the system).
Another desirable property of a probabilistic estimation system, although not directly related to Bayesianity, is resolution (also called sharpness).It is the capacity of the system for a priori distinguishing between different outcomes.For instance, a system which always predicts the climatological probability distribution is perfectly reliable, but has no resolution.Resolution, like reliability, can be objectively evaluated if a large enough verification sample is available.
We will use several standard diagnostic tools for validation of our results.We first note that the error in the mean of the predicted ensembles is itself a measure of resolution.The smaller that error, the higher the capacity of the system to a priori distinguish between different outcomes.Concerning reliability, the classical rank histogram and the reduced centred random variable (RCRV) (the latter is described in Appendix A) are (non-equivalent) measures of the reliability of probabilistic prediction of a scalar variable.The reliability diagram and the associated Brier score are relative to probabilistic prediction of a binary event.The Brier score decomposes into two parts, which measure respectively the reliability and the resolution of the prediction.The definition used here for those components is given in Appendix A (Eqs.A4 and A5 respectively).Both scores are positive, and are negatively oriented, so that perfect reliability and resolution are achieved when the corresponding scores take the value 0. For more on these diagnostics and, more generally, on objective validation of probabilistic estimation systems, see e.g.chap.8 of the book by Wilks (2011), as well as the papers by Talagrand et al. (1997) and Candille and Talagrand (2005).

Numerical results: the linear case
We present in this section results obtained in an exactly linear and Gaussian case, in which theory says that EnsVAR must produce an exact Monte Carlo Bayesian sample.These results are to be used as a benchmark for the evaluation of later results.The numerical model (Eq.6) is obtained by linearizing the non-linear Lorenz model, which describes the space-time evolution of a scalar variable denoted x, about one particular solution (the Lorenz model will be described and discussed in more detail in Sect.5; see Eq. 12 below).The model space dimension N is equal to 40.The length T of the assimilation windows is 5 days, which covers N t = 20 timesteps (the "day" will be defined in the next section).The complete state vector (H k = I in Eq. 7) is observed every 0.5 days (K = 10).The data vector z has therefore dimension (K + 1)N = 440.The observation errors are Gaussian, spatially uncorrelated, with constant standard deviation σ = 0.1 (R k = σ 2 I, ∀k).However, because of the linearity, the absolute amplitude of those errors must have no impact.
Since conditions for exact Bayesianity are verified, any deviation in the results from exact reliability can be due to only the finiteness N ens of the ensembles (except for the rank histogram, which takes that finiteness into account), the finiteness N win of the validation sample, or numerical effects (such as resulting from incomplete minimization or round-off errors).
Figure 1 shows the root-mean-square errors from the truth along the assimilation window, averaged at each time over all grid points and all realizations.The upper (blue) curve shows the average error in the individual minimizing solutions of J iens (Eq.9).The lower (red) curve shows the error in the mean of the individual ensembles, while the green curve shows the error in the fields obtained in minimizations performed with the raw unperturbed observations y k (Eq.7).
All errors are smaller than the observation error (horizontal dashed-dotted line).The estimation errors are largest at both ends of the assimilation window and smallest at some intermediate time.As known, and already discussed by various authors (Pires et al., 1996;Trevisan et al., 2010), this is due to the fact that the error along the stable components of the flow decreases over the assimilation window, while the error along the unstable components increases.The ratio between the values on the blue and green curves, averaged over the whole assimilation window, is equal to 1.414.This is close to √ 2 as can be expected from the linearity of the process and the perturbation procedure defined by Eqs. ( 7)-( 8) (actually, it can be noted that the value √ 2 is itself, independently of any linearity, a test for reliability, since the standard deviation of the difference between two independent realizations of a random variable must be equal to √ 2 times the standard deviation of the variable itself).The green curve corresponds to the expectation of (what must be) the Bayesian probability distribution, while the red curve corresponds to a sample expectation, computed over N ens ele-ments.The latter expectation is therefore not, as can be seen on the figure, as accurate an estimate of the truth.The relative difference must be about 1 2N ens ≈ 0.017.This is the value obtained here.
For a reliable system, the reduced centred random variable, which we denote s, has expectation 0 and variance 1 (see Appendix A).The sample values, computed over all grid points, times, and assimilation windows (which amounts to a set of size N x •(N t +1)•N win = 7.56×10 6 ), are E(s) = 0.0035 and Var(s) = 1.00.
Figure 2 shows other diagnostics of the statistical performance of the system, performed again over all 7.56 × 10 6 individual ensembles in the experiment.The top-left panel is the rank histogram.The top-right panel is the reliability diagram relative to the event {x > 1.14}, which occurs with frequency 0.32 (black horizontal dashed-dotted line in the diagram).Both panels visually show high reliability (flatness for the histogram, closeness to the diagonal for the reliability diagram), although that reliability is obviously not perfect.More accurate quantitative diagnostics are given by the lower panel, which shows, as functions of the threshold τ , the two components (reliability and resolution; see Eqs.A4 and A5 respectively) of the Brier score for the events {x > τ }.The reliability component is about 10 −3 ; the resolution component is about 5 × 10 −2 .A further diagnostic has been made by comparison with an experiment in which the validating truth has been obtained, for each of the N win windows, from an additional independent (N ens + 1)st variational assimilation.That procedure is by construction perfectly reliable, and any difference with Fig. 2 could result only from the fact that the validating truth is not defined by the same process.The reliability (not shown) is very slightly improved in comparison with Fig. 2 (this could be possibly due to a lack of full convergence of the minimizations).The resolution is not modified.
It is known that the minimum J min = J (x a ) of the objective function (Eq. 3) takes on average the value where p = N z − N x has been defined as the degree of overdeterminacy of the minimization.This result is true provided the following two conditions are verified: (i) the operator is linear and (ii) the error ζ in Eq. ( 1) has expectation 0 and the covariance matrix used in the objective function (Eq.3).It is independent of whether ζ is Gaussian or not.But when ζ is Gaussian, the quantity 2J min follows a χ 2 probability distribution of order p (for that reason, Eq. 10 is often called the χ 2 condition, although it is verified in circumstances where 2J min does not follow a χ 2 distribution).As a consequence, the minimum J min has standard deviation In the present case, N x = 40 and N z = (K + 1)N x = 440, so that p/2 = 200 and √ p/2 ≈ 14.14.The histogram of the minima J min (corrected for a multiplicative factor 1/2 resulting from the additional perturbations, Eq. 8) is shown in Fig. 3.The corresponding empirical expectation and standard deviation are 199.39 and 14.27 respectively, in agreement with Eqs. ( 10)-( 11).It can be noted that, as a consequence of the central limit theorem, the histogram in Fig. 3 is in effect Gaussian.Indeed the value of negentropy, a measure of Gaussianity that will be defined in the next section, is 0.0012.
For the theoretical conditions of exact Bayesianity considered here, reliability should be perfect and should not be degraded when the information content of the observations de- creases (through increased observation error and/or degraded spatial and/or temporal resolution of the observations).Statistical resolution should, on the other hand, be degraded.Experiments have been performed to check this aspect (the exact experimental procedure is described in Sect.5).The numerical results (not shown) are that both components of the Brier score are actually degraded and can increase by 1 order of magnitude.The reliability component always remains much smaller than the resolution component, and the degradation of the latter is much more systematic.This is in good agreement with the fact that the degradation of reliability can be due to only numerical effects, such as less efficient minimizations.
The above results, obtained in the case of exact theoretical Bayesianity, are going to serve as reference for the evaluation of EnsVAR in non-linear and non-Gaussian situations where Bayesianity does not necessarily hold.

Numerical results: the non-linear case
The non-linear Lorenz-96 model (Lorenz, 1996;Lorenz and Emanuel, 1998) reads where j = 1, . .., N represent the spatial coordinate (longitude), with cyclic boundary conditions.As in Lorenz (1996), we choose N = 40 and F = 8.For these values, the model is chaotic with 13 positive Lyapunov exponents, the largest of which has a value of (2.5 day) −1 , where 1 day is equal to 0.24 time unit in Eq. ( 12).This is the definition of "day" we will use hereafter.It is slightly different from the choice made in Lorenz (1996), where the day is equal to 0.2 time unit in Eq. ( 12).The difference is not critical for the sequel, nor for possible comparison with other works.
Except for the dynamical model, the experimental setup is fundamentally the same as in the linear case.In particular, the model time step 0.25 days (our definition), the observation frequency 0.5 days, and the values N ens = 30 and N win = 9000 are the same.The observation error is uncorrelated in space and time, with constant variance σ 2 = 0.4 (R k = σ 2 I, ∀k).The associated standard deviation σ = 0.63 is equal to 2 % of the variability of the reference solution (it is because of the different range of variability that the value of σ has been chosen different from the value in the linear case).We mention again that no cycling is present between successive assimilation windows.
The results are shown on Fig. 4. The top panels are relative to one particular assimilation window.In the left panel, where the horizontal coordinate is the spatial position j , the black dashed curve is the reference truth at the initial time of the assimilation window, the blue circles are the corresponding observations, and the full red curves (N ens = 30 of them) are the minimizing solutions at the same time.The right panel, where the horizontal coordinate is time along the assimilation window, shows the truth (dashed curve) and the N ens minimizing solutions (full red curves) at three differ- ent points in space.Both panels show that the minimizations reconstruct the truth with a high degree of accuracy.
The bottom panel, which shows error statistics accumulated over all assimilation windows, is in the same format as Fig. 1 (note that, because of the different dynamics and observational error, the amplitude on the vertical axis is different from Fig. 1).The conclusions are qualitatively the same.The estimation error, which is smaller than the observational error, is maximum at both ends of the assimilation window and minimum at some intermediate time.The ratio between the blue and red curves, equal on average to 1.41, is close to the value √ 2, which, as already said, is in itself an indication of reliability.But a significant difference is that the green curve lies now above the red curve.One obtains a better approximation of the truth by taking the average of the N ens minimizing solutions than by performing an assimila- tion on the raw observations (Eq.7).This is an obvious nonlinear effect.One can note it is fully consistent with the fact that the expectation of the a posteriori Bayesian probability distribution is the variance-minimizing estimate of the truth.The expectation and variance of the RCRV are respectively E(s) = 0.012 and Var(s) = 1.047.
Figure 5, which is in the same format as Fig. 2, shows similar diagnostics: rank histogram; reliability diagram for the event {x < 1.0}, which occurs with frequency 0.33; and the two components of the Brier score for events of the form {x > τ }.The general conclusion is the same as in the linear case.High level of reliability is achieved.Actually, the reliability component of the Brier score (bottom panel) is now decreased below 10 −3 .That improvement, in the present situation where exact Bayesianity cannot be expected, can only be due to better numerical conditioning than in the linear case.The resolution component of the Brier score, on the other hand, is increased.
Figure 6 is relative to experiments in which the informative content of the observations, i.e. their temporal density, spatial density, and accuracy (top, middle, and bottom panels respectively), has been varied.Each panel shows the two components of the Brier score, in the same format as in the bottom panels of Figs. 2 and 5 (but with more curves corresponding to different informative contents).The reliability component (red curves) always remains significantly smaller than the resolution component (blue curves).With the exception of the reliability component in the top panel, both components are systematically degraded when the information content of the observations decreases.This is certainly to be expected for the resolution component, but not necessarily for the reliability component.The degradation of the latter is significantly larger than in the linear case (not shown), where we concluded that it could be due only to degradation of numerical conditioning.The degradation of reliability in the lower two panels may therefore be due here to nonlinearity.One noteworthy feature is that the degradation of the resolution scores, for the same total decrease in the number of observations, is much larger for the decrease in spatial density than for the decrease in temporal density (middle and top panels respectively).Less information is therefore lost in degrading the temporal than the spatial density of observations.Figure 7 shows the distribution of (half) the minima of the objective function (it contains the same information as Fig. 3, in a different format).Most values are concentrated around the linear value 200, but a small number of values are present in the range 600-1000.Excluding these outliers, the expectation and standard deviation of the minima are 199.62 and 14.13 respectively.These values are actually in better agreement with the theoretical χ 2 values (200 and 14.14) than the ones obtained above in the theoretically exact Bayesian case (199.39 and 14.27).This again suggests better numerical conditioning for the non-linear situation.
In view of previous results, in particular results obtained by Pires et al. (1996), a likely explanation for the presence of the larger minima in Fig. 7 is the following.Owing to the non-linearity of Eq. ( 12), and more precisely to the folding which occurs in state space as a consequence of the chaotic character of the motion, the uncertainty in the initial state is distributed along a folded subset in state space.It occasionally happens that the minimum of the objective function falls in a secondary fold, which corresponds to a larger value of the objective function.This aspect will be further discussed in the second part of the paper.In any case, the presence of larger minima of the objective function is an obvious sign of non-linearity.
Non-linearity is also obvious in Fig. 8, which shows, for one particular minimization, a cross section of the objective function between the starting point of the minimization and the minimum of the objective function (black curve), as well as a parabola going through the starting point and having the same minimum (red curve).The two curves are distinctly different, while they would be identical in a linear case.
We have evaluated the Gaussian character of univariate marginals of the ensembles produced by the assimilation by computing their negentropy.The negentropy of a probability distribution is the Kullback-Leibler divergence of that distribution with respect to the Gaussian distribution with the same expectation and variance (see Appendix B).The negentropy is positive and is equal to 0 for exact Gaussianity.The mean negentropy of the ensembles is here ≈ 10 −3 , indicating closeness to Gaussianity (for a reference, the negentropy of the Laplace distribution is 0.072).Although non-linearity is present in the whole process, EnsVAR produces ensembles that are close to Gaussianity.
Experiments have been performed in which the observational error, instead of being Gaussian, has been taken to follow a Laplace distribution (with still the same variance σ 2 = 0.4).No significant difference has been observed in the results in comparison with the Gaussian case.This suggests that the Gaussian character of the observational error is not critical for the conclusions obtained above.6 Comparison with the ensemble Kalman filter and the particle filter We present in this section a comparison with results obtained with the ensemble Kalman filter (EnKF) and the particle filter (PF).As used here, those filters are sequential in time.Fair comparison is therefore possible only at the end of the assimilation window.Figure 9 shows the diagnostics obtained from EnsVAR at the end of the window (the top-left panel, identical with the top-right panel of Fig. 4, is included for easy comparison with the figures that will follow).Compar-ison with Fig. 5 shows that the reliability (as measured by the rank histogram, the reliability diagram, and the reliability component of the Brier score) is significantly degraded.It has been verified (not shown) that this degradation is mostly due not to a really degraded performance at the end of the window, but to the use of a smaller validation sample (by a factor of N t+1 = 21, which leads to a sample with size 3.6 × 10 5 ).
Figure 10, which is in the same format as Fig. 9, shows the same diagnostics for the EnKF.The algorithm used is the one described by Evensen (2003).It is stochastic in the sense that observations have been perturbed randomly, for updat- ing the background ensembles, according to the probability distribution of the observation errors.Spatial localization of the background error covariance matrix has been implemented by Schur-multiplying the sample covariance matrix by a squared exponential kernel with length scale 12.0 (the positive definiteness of the periodic kernel has been ensured by removing its negative Fourier components).And multiplicative inflation with factor r = 1.001 has been applied, as in Anderson and Anderson (1999), on the ensemble after each analysis.
Comparison with Fig. 9 shows that the individual ensembles, after a warm-up period, tend to remain more dispersed than in EnsVAR (top-left panel).Reliability, as measured by the reliability diagram and the Brier score, is similar to what it is in Fig. 9.But it is significantly degraded as evaluated by the rank histogram.The ensembles, although they have larger absolute dispersion than in EnsVAR, tend to miss reality more often.
Following comments from referees, we have made a few experiments not using localization in the EnKF.The RMSE and the RCRV are significantly degraded, while the rank histogram and the resolution component of the Brier score are improved.The reliability component of the Brier score remained the same.All this is true for both assimilation and forecast.These results, not included in the paper, would deserve further studies which are postponed for a future work.
Figure 11 (again in the same format as Fig. 9) shows the same diagnostics for a particle filter.The algorithm used here is the "Sampling Importance Particle Filter" presented in Arulampalam et al. (2002).Comparison with Fig. 10 shows first that the individual ensembles are still more dispersed than in EnKF (top-left panel).It also shows a slight degradation of the reliability component of the Brier score (and, incidentally, a significant degradation of the resolution component), but no visible difference on the reliability diagram.Concerning the rank histogram, PF produces unequally weighted particles, and the standard histogram could not be used.A histogram has been built instead on the quantiles defined by the weights of the particles.This shows, as for EnKF, a significant tendency to miss the truth.
The left column of Table 1 shows the mean root-meansquare error in the means of the ensembles as obtained from  Figures 12-14 are relative to ensemble forecasts performed, for each of the three assimilation algorithms, from the ensembles obtained at the end of the 5-day assimilations.They are in the same format as Fig. 9 and show diagnostics at the end of 5-day forecasts.One can first observe that the dispersion of individual forecasts (top-left panels) increases, as can be expected, with the forecast range, but much less with the EnsVAR than with EnKF and PF.Reliability, as measured by the Brier score, is slightly degraded in all three algorithms with respect to the case of the assimilations.It is slightly worse for EnKF than for EnsVAR and significantly worse for PF.Resolution is, on the other hand, significantly degraded in all three algorithms.This is associated with the dispersion of ensembles and corresponds to what could be expected.Concerning the rank histograms, the histogram of EnsVAR, although still noisy, shows no systematic sign of over-or underdispersion of the ensembles.The EnKF and PF histograms both present, as before, what appears to be a significant underdispersion.
Finally, the right column of Table 1 shows that RMS errors, which are of course now larger, still rank comparatively in the same order as before, i.e.EnsVAR < EnKF < PF.
www.nonlin-processes-geophys.net/25/565/2018/ Nonlin.Processes Geophys., 25, 565-587, 2018 where the spatial period L is a bifurcation parameter for the system.The K-S equation models pattern formations in different physical contexts and is a paradigm of lowdimensional behaviour in solutions to partial differential equations.It arises as a model amplitude equation for interfacial instabilities in many physical contexts.It was originally derived by Kuramoto andTsuzuki (1975, 1976) to model small thermal diffusive instabilities in laminar flame fronts in two space dimensions.Equation (13) has been used here with the value L = 32π and has been discretized to 64 Fourier modes.In accordance with the calculations of Manneville (1985), we observe chaotic motion with 27 positive Lyapunov exponents, with the largest one being λ max ≈ 0.13.
Equation ( 13) is known to be stiff.The stiffness is due to rapid exponential decay of some modes (the dissipative part) and to rapid oscillations of other modes (the dispersive part).Figure 15, where the two panels are in the same format as Fig. 1, shows the errors in the EnsVAR assimilations, in both a linearized (top panel) and a fully non-linear (bottom panel) cases.The length of the assimilation window, marked as 1 on the figure, is equal to 1 λ max ≈ 7.7 in units of Eq. ( 13), i.e. a typical predictability time of the system.The shapes of the curves show that the K-S equation has globally more stability and less instability than the Lorenz equation.The figure shows similar performance for the linear and non-linear situation.Other results (not shown) are also qualitatively very similar to those obtained with the Lorenz equation: high re-liability of the ensembles produced by EnsVAR and slightly superior performance over EnKF and PF.

Summary and conclusions
Ensemble variational assimilation (EnsVAR) has been implemented on two small-dimension non-linear chaotic toy models, as well as on linearized versions of those models.
One specific goal of the paper was to stress what is in the authors' mind a critical aspect, namely to systematically evaluate ensembles produced by ensemble assimilation as probabilistic estimators.This requires us to consider these ensembles as defining probability distributions (instead of evaluating them principally, for instance, by the error in their mean).In view of the impossibility of objectively validating the Bayesianity of ensembles, the weaker property of reliability has been evaluated instead.In the linear and Gaussian case, where theory says that EnsVAR is exactly Bayesian, the reliability of the ensembles produced by EnsVAR is high, but not numerically perfect, showing the effect of sampling errors and, probably, of numerical conditioning.
In the non-linear case, EnsVAR, implemented on temporal windows on the order of magnitude of the predictability time of the systems, shows as good (and in some cases slightly better) performance as in the exactly linear case.Comparison with the ensemble Kalman filter (EnKF) and the particle filter (PF) shows EnsVAR is globally as good a statistical estimator as those two other algorithms.
On the other hand, EnsVAR, at it has been implemented here, is numerically more costly than either EnKF or PF.And the specific algorithms used for the latter two methods may not be the most efficient.But it is worthwhile to evaluate EnsVAR in the more demanding conditions of stronger nonlinearity.That is the object of the second part of this work.

Figure 1 .
Figure1.Root-mean-square errors from the truth as functions of time along the assimilation window (linear and Gaussian case).Blue curve: error in individual minimizations.Red curve: error in the means of the ensembles.Green curve: error in the assimilations performed with the unperturbed observations y k (Eq.7).Dashed-dotted horizontal curve: standard deviation of the observation error.Each point on the blue curve corresponds to an average over a sample of N x • N win • N ens = 1.08 × 10 7 elements and each point on the red and green curves to an average over a sample of N x • N win = 3.6 × 10 5 elements.

Figure 2 .
Figure 2. Diagnostics of statistical performance (linear and Gaussian case).(a) Rank histogram for the model variable x.(b) Reliability diagram for the event E = {x > 1.14} (black horizontal dashed-dotted line: frequency of occurrence of the event).(c)Variation with threshold τ of the reliability and resolution components of the Brier score for the events E = {x > τ } (red and blue curves respectively; note the logarithmic scale on the vertical).The diagnostics have been computed over all grid points, timesteps, and realizations, making up a sample of size 7.56 × 10 6 .

Figure 3 .
Figure3.Histogram of (half) the minima of the objective function (Eq.9), along with the corresponding mean (vertical black line) and standard deviation (horizontal blue line) (linear and Gaussian case).

Figure 4 .
Figure 4. Diagnostics relative to the non-linear and Gaussian case, with assimilation over 5 days.(a) and (b) are relative to one particular assimilation window.(a) (horizontal coordinate: spatial position j ) Reference truth at the initial time of the assimilation window (black dashed curve), observations (blue circles), and minimizing solutions (full red curves).(b) (horizontal coordinate: time along the assimilation window) Truth (dashed curve) and minimizing solutions (full red curves) at three points in space.(c) Overall diagnostics of estimation errors (same format as in Fig. 1).

Figure 5 .
Figure5.Same as Fig.2, for the non-linear case (for the event E = {x < 1.}, which occurs with frequency 0.33, as concerns the reliability diagram on the top-right panel).

Figure 6 .
Figure 6.Impact of the informative content of observations on the two components of the Brier score (non-linear case).The format of each panel is the same as the format of the bottom panels of Figs. 2 and 5 (red and blue curves: reliability and resolution components respectively).(a) Impact of the temporal density of the observations.Observations are performed every grid point, with error variance σ 2 = 0.4; every time step (full curves); and every second and fourth timestep (dashed and dashed-dotted curves respectively).(b) Impact of the spatial density of the observations.Observations are performed every timestep, with error σ 2 = 0.4; at every grid point (full curves); and every second and fourth grid point (dashed and dashed-dotted curves respectively).(c) Impact of the variance σ 2 of the observation error.Observations are performed every second timestep and at every grid point with observation error SD σ = √ 0.4, 2 √ 0.4, and 4 √ 0.4 (full, dashed, and dashed-dotted curves respectively.

Figure 7 .
Figure 7. Values of (half) the minima of the objective function for all realizations (non-linear case) (horizontal coordinate: realization number; vertical coordinate: value of the minima).

Figure 8 .
Figure8.Cross section of the objective function J iens , for one particular minimization, between the starting point of the minimization and the minimum of J iens (black curve).Parabola going through the starting point and having the same minimum (red curve).

Figure 9 .
Figure 9. (a) Identical with the top-right panel of Fig. 4, repeated for comparison with figures that follow.The other panels show the same diagnostics as in Fig. 5 but performed at the final time of the assimilation windows.(b) Rank histogram.(c) Reliability diagram for the event E = {x > 1.33}, which occurs with frequency 0.42.(d) Components of the Brier score for the events E = {x > τ } (same format as in the bottom panels of Figs. 2 and 5).

Figure 11 .
Figure 11.Same as Fig. 9, for the particle filter.

Figure 12 .
Figure12.Same as Fig.9, but at the end of 5-day forecasts.On the top-left panel the horizontal axis spans both the assimilation and the forecast intervals.

Figure 15 .
Figure15.Same as Fig.1, for variational ensemble assimilations performed on the Kuramoto-Sivashinsky equation, i.e. root-mean-square error from the truth along the assimilation window, averaged at each time over all grid points and all realizations, for both the linear and non-linear cases (a and b respectively).

Table 1 .
RMS errors at the end of 5 days of assimilation (left column) and of 5 days of forecast (right column) for the three algorithms.