This paper presents the results of the ensemble Riemannian data assimilation for relatively high-dimensional nonlinear dynamical systems, focusing on the chaotic Lorenz-96 model and a two-layer quasi-geostrophic (QG) model of atmospheric circulation. The analysis state in this approach is inferred from a joint distribution that optimally couples the background probability distribution and the likelihood function, enabling formal treatment of systematic biases without any Gaussian assumptions. Despite the risk of the curse of dimensionality in the computation of the coupling distribution, comparisons with the classic implementation of the particle filter and the stochastic ensemble Kalman filter demonstrate that, with the same ensemble size, the presented methodology could improve the predictability of dynamical systems. In particular, under systematic errors, the root mean squared error of the analysis state can be reduced by 20 % (30 %) in the Lorenz-96 (QG) model.

The science of data assimilation (DA) aims to optimally estimate the probability distribution of a state variable of interest in an Earth system model (ESM) given the information content of observations and previous time forecasts to improve their predictive abilities

Apart from the Euclidean distance, other measures and distance metrics, including the quadratic mutual information

In filtering class of DA methodologies, coupling techniques have been proposed as an alternative to the classic Bayesian inference

However, DA frameworks utilizing the Wasserstein distance are computationally expensive as they require a joint distribution to be obtained that couples two marginal distributions. Finding this joint distribution often relies on interior-point optimization methods

Unlike Euclidean DA with a known connection with the family of Gaussian distributions through Bayes' theorem, the EnRDA does not rely on any parametric assumptions about the input probability distributions. Therefore, it does not guarantee an analysis state with a minimum mean squared error. However, it enables us to optimally (i) interpolate between the forecast distribution and the normalized likelihood function without any parametric assumptions about their shapes and (ii) formally penalize systematic translations between them arising due to potential geophysical biases.

However, the computational complexity of finding an optimal joint coupling between two

The outline of the paper is as follows. Section

We provide a brief background on the theory of optimal mass transport (OMT) and Wasserstein barycenters. The OMT theory, first put forward by

Let us consider a discrete source probability distribution

The problem formulation by Monge as expressed in Eq. (

Recall that, over the Euclidean space, the barycenter of a group of points is equivalent to their (weighted) mean value. The Wasserstein metric offers a Riemannian generalization of this problem and allows us to define the barycenter of a family of probability distributions

In this section, to be self content, we provide a brief summary of the EnRDA methodology, while more details can be found in

Hereafter, we drop the time superscript for brevity and represent the model (or background) probability distribution as

To solve the above DA problem, we need to characterize the background distribution and the normalized likelihood function. Similar to the approach used in the particle filter

To obtain the Wasserstein barycenter

Computation of the joint distribution in Eq. (

As an example, we examine here the solution of Eq. (

It is important to note that in the original OMT formulation, the number of support points required for the optimal joint coupling

The analysis distribution obtained as a Wasserstein barycenter for different values of the displacement parameter

The Lorenz model (Lorenz-96,

We focus on the 40-dimensional Lorenz-96 system (i.e.,

Similar to the suggested experimental setting in

To characterize the distribution of the background state for each DA methodology, 50 (5000) ensemble members (particles) for the SEnKF and EnRDA (PF) are generated using model errors

The results of EnRDA are shown in Fig.

Temporal evolution of the root mean squared error (RMSE) for the

As previously noted, the displacement parameter

The bias and RMSE, together with their respective 5th–95th percentile bounds, as functions of the displacement parameter

One may argue that such a tuning favors EnRDA since it explicitly accounts for the effects of bias, either in background or observations, while there is no bias-correction mechanism in the implementation of the SEnKF and the PF. To make a fairer comparison, we investigate an alternative approach to approximate the displacement parameter solely based on the known error covariance matrices at each assimilation cycle. Recall that in classic DA, the analysis state is essentially the Euclidean barycenter, where the relative weights of the background state and observations are optimally characterized based on the error covariances under zero bias assumptions. However, over the Wasserstein space, the displacement parameter determines the weight between the entire distribution of the background and the normalized likelihood function. Theoretically, knowing the Wasserstein distances from ground truth to both likelihood function and forecast probability distribution enables us to obtain an optimal value for

It is known that the square of the Wasserstein distance between two equal-mean Gaussian distributions

Comparisons of the RMSE values for the studied DA methodologies as a function of ensemble size are shown in Fig.

The RMSE for the different number of ensemble members/particles in the PF, SEnKF, and EnRDA when the displacement parameter is obtained from bias-aware cross-validation (ENRDA-I) and a dynamic approach without a priori knowledge of bias (EnRDA-II) for the Lorenz-96 system. The dashed line is the standard deviation of the observation error.

It is also important to note that the higher RMSE of the PF compared to the SEnKF and EnRDA is due to the problem of filter degeneracy, which is further exacerbated by the presence of systematic errors in model forecasts

To further test the efficiency of EnRDA, another configuration of the Lorenz-96 is implemented using a Laplace-distributed observation error at each assimilation interval of 10

The multilayered QG

For a two-layer QG model (

Due to the high dimensionality of the QG model and the well-known problem of filter degeneracy in the PF, we chose to omit its application to the QG model. Similarly to the study conducted in

From the initial value of the streamfunction field in each layer, potential vorticity is obtained using a nine-point second-order finite difference scheme to compute the Laplacian in Eq. (

The ground truth of the streamfunction is obtained by integrating the QG model with a time step of

To characterize the distribution of the background state, 50 ensemble members for both SEnKF and EnRDA are generated using model errors

In the SEnKF, to alleviate the well-known problem of undersampling

The streamfunction analysis state

The true state, background state, and observations of the bottom layer streamfunction at the first assimilation cycle

The results of the DA experiments using the SEnKF and EnRDA at the first assimilation cycle for the bottom layer are also shown in Fig.

Average root mean squared error (RMSE) values as a function of the displacement parameter

The average RMSE values as a function of assimilation intervals 6, 12 and 18 h in the SEnKF and EnRDA for the two-layer quasi-geostrophic model.

We further examined the performance of the EnRDA and the SEnKF on the QG model with a

In this study, we demonstrated that data assimilation (DA) over the Wasserstein space through the EnRDA

One of the major weaknesses of the presented methodology in its current form is that all dimensions of the problem are assumed to be observable. This is an important issue when it comes to the assimilation of sparse data. Future research is needed to address partial observability in DA over the Wasserstein space. A possible direction is through multi-marginal optimal mass transport

It should be noted that the experimental settings presented here only deal with the univariate state variable. The use of a scalar regularization parameter in the EnRDA penalizes the transportation cost matrix elements uniformly even when the physical variables of interest are different by orders of magnitude. A possible future solution to this problem can be obtained by rather utilizing Mahalanobis or a weighted Euclidean distance

Although EnRDA demonstrated a reasonable performance on the presented dynamical systems without significant computational burden, the computational complexity might be a limiting factor for its large-scale implementation. In Earth system models, where the dimension easily exceeds hundred of millions, dimensionality reduction might be necessary. One might hypothesize that the optimal transportation plan remains unaltered for the change in the basis. Thus, future research can be devoted to examining the optimal transportation plan for the principal components

To solve the regularized optimal mass transport problem in Eq. (

Now, we set the first-order derivative of the Lagrangian form in Eq. (

By setting the derivatives of the Lagrangian with respect to the Lagrange multipliers to zero, we recover the two conditions, which we can write as

The ensemble size in the SEnKF, if much smaller than the state dimension, such as in the presented case of the quasi-geostrophic model, leads to underestimation of the forecast error covariance matrix and subsequently filter divergence problems. To alleviate this problem, a covariance inflation procedure can be implemented by multiplying the forecast error covariance matrix by an inflation factor

The covariance localization procedure in the SEnKF further attempts to improve its performance by ignoring the spurious long-range dependence in the ensemble background covariance by applying a prespecified cutoff threshold to the correlation structure of the field. An SEnKF equipped with a tuned localization procedure can be efficiently used in high-dimensional atmospheric and ocean models even with fewer than 100 ensemble members

Following the work of

In our implementation of the SEnKF in the QG model, the inflation factor and length scale were chosen between

A demo code for EnRDA in the MATLAB programming language can be downloaded at

No data sets were used in this article.

SKT and AE designed the study. SKT implemented the formulation and analyzed the results. PJvL, GL and EFG provided conceptual advice, and all the authors contributed to the writing.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors thank the Minnesota Supercomputing Institute (MSI) at the University of Minnesota for providing resources that contributed to the research results reported within this paper.

This research has been supported by the National Aeronautics and Space Administration Remote Sensing Theory program (RST, grant no. 80NSSC20K1717), the Interdisciplinary Research in Earth Science program (IDS, grant no. 80NSSC20K1294), the New (Early Career) Investigator Program (NIP, grant no. 80NSSC18K0742), the European Research Council, H2020 European Research Council (CUNDA (grant no. 694509)), and the National Science Foundation (grant nos. DMS1830418 and ECCS-1839441).

This paper was edited by Alberto Carrassi and reviewed by two anonymous referees.