Ensemble Riemannian Data Assimilation over the Wasserstein Space

In this paper, we present an ensemble data assimilation paradigm over a Riemannian manifold equipped with the Wasserstein metric. Unlike the Eulerian penalization of error in the Euclidean space, the Wasserstein metric can capture translation and difference between the shapes of square-integrable probability distributions of the background state and observations -- enabling to formally penalize geophysical biases in state-space with non-Gaussian distributions. The new approach is applied to dissipative and chaotic evolutionary dynamics and its potential advantages and limitations are highlighted compared to the classic variational and filtering data assimilation approaches under systematic and random errors.


Introduction
Extending the forecast skill of Earth System Models (ESM) relies on advancing the science of Data Assimilation (DA) [15,79]. A large body of current DA methodologies, either filtering [27,28,39,71] or variational approaches [14,25,46,49,61,74,78], are derived from basic principles of Bayesian inference under the assumption that the state-space is unbiased and can be represented well with Gaussian distributions, which are not often consistent with reality [11,66]. It is well documented that this drawback often limits forecast skills of DA systems [16,22,26,83] especially under the presence of systematic errors [21].
Apart from particle filters [73,80], which are intrinsically designed for state-space with non-Gaussian distribution, numerous modifications to the variational DA (VDA) and ensemblebased filtering methods have been made to tackle non-Gaussianity of geophysical processes [7,35,51,65]. As a few examples, in four-dimensional VDA, a quasi-static VDA is proposed to ensure convergence by gradually increasing the assimilation intervals [65]. Kim et al. [41] proposed modifications to the ensemble Kalman filter [EnKF; 27,47] using approximate implementation of Bayes' theorem in lieu of linear interpolation via Kalman gain to deal with multimodal systems. For ensemble-based filters, Anderson [7] proposed a new approach to account for non-Gaussian priors and posteriors by utilizing rank histograms [6,34]. A hybrid ensemble approach was also suggested to combine advantages of both EnKF and particle filter [51].
Even though particle filters can handle non-Gaussian likelihood functions, when observations lie away from the support set of the particles, the ensemble variance tends to zero over time and can render the filter degenerate [67]. In recent years, significant progress has been made to treat systematic errors through numerous ad hoc methods such as the field alignment technique [68] and morphing EnKF [8] that can tackle position errors between observations and forecast. Dual state-parameter EnKF [55] was also developed to resolve systematic errors originating from parameter uncertainties. Additionally, bias aware variants of the Kalman filter were designed [19,20,24,42] to simultaneously update the state-space and an a priori estimate of the additive biases. In parallel, the cumulative distribution function matching [70] has garnered widespread attention in land DA.
From a geometrical perspective, Gaussian statistical inference methods exhibit a flat geometry [3]. In particular, it is proved that linear auto-regressive and moving average Markov stochastic models, which are driven by Gaussian noise, form dually flat manifolds [4]. The notion of distance over such a geometrically flat space is defined over a straight line, which can be quantified by the Euclidean distance. Consequently, the Euclidean space has served as a major tool in explaining statistical inference techniques using linear Gaussian models and has been used as a cornerstone of DA techniques. It is important to note that the Euclidean distance is "Eulerian" [58] and thus remains insensitive to the magnitude of translation between probability distributions with disjoint support sets -when used to interpolate between them.
Non-Gaussian statistical models often form geometrical manifolds. In the case of nonlinear regression, it is demonstrated that the statistical manifold exhibits a Riemannian geometry [45] over which the notion of distance between probability distributions is geodesic. Such a distance metric shall be Lagrangian to not only capture translation but also the difference between the entire shape of probability distributions [62]. How can we equip DA with a Riemannian geometry? To answer this question, inspired by the theories of optimal mass transport [82], this paper presents the Ensemble Riemannian Data Assimilation (EnRDA) framework using the Wasserstein distance metric.
In recent years, a few attempts have been made to utilize the Wasserstein metric in geophysical DA. Reich [69] introduced an ensemble transform particle filter, where the optimal transport framework was utilized to guide the resampling phase of the filter. Ning et al. [58] used the Wasserstein distance to reduce forecast uncertainty due to parameter estimation errors in dissipative evolutionary equations. Feyeux et al. [29] suggested a novel approach employing the Wasserstein distance in lieu of the Euclidean distance to penalize the position error between state and observations. More recently, Tamang et al. [75] introduced a Wasserstein regularization in a variational setting to correct for geophysical biases under chaotic dynamics.
The EnRDA extends the previous work through the following main contributions: (a) EnRDA defines DA as a discrete barycenter problem over the Wasserstein space for assimilation in probability domain without any parametric or Gaussian assumption. The frame-work provides a continuum of non-parametric analysis probability histograms that naturally span between the distributions of the background state and observations through optimal transport of probability masses. (b) EnRDA operates in an ensemble setting using the entropic regularization by utilizing the Sinkhorn algorithm [18] for improving computational efficiency. (c) The paper studies advantages and limitations of DA over the Wasserstein space for dissipative advection-diffusion dynamics and nonlinear chaotic Lorenz-63 model in comparison with 3D Variational (3D-Var) DA as well as filtering techniques.
The organization of the paper is as follows: Section 2 provides a brief background on Bayesian DA formulations and optimal mass transport. The mathematical formalism of the EnRDA is described in Section 3. Section 4 presents the results and compares them with their Euclidean counterparts. Section 5 discusses the findings and ideas for future research.

Notations
Throughout, small bold letters represent m-element column vectors x = (x 1 , . . . , x m ) T ∈ R m , where (·) T is the transposition operator. The m-by-n matrices X ∈ R m×n are denoted by capital bold letters, whereas R m + (R m×n + ) denotes those vectors (matrices) only containing non-negative real numbers. The 1 m refers to an m-element vector of ones and I m is an m × m identity matrix. A diagonal matrix with entries given by x ∈ R m is represented by diag(x) ∈ R m×m . Notation x ∼ N (µ, Σ) denotes that the random vector x is drawn from a Gaussian distribution with mean µ and covariance Σ and E X (x) is the expectation of x. The q -norm of x is defined as x q = m i=1 |x i | q 1/q with q > 0 and the square of the weighted 2 -norm of x is represented as where B is a positive definite matrix. Notations of x y and x y represent the element-wise Hadamard product and division between equal length vectors x and y, respectively. Notation A, B = tr(A T B) denotes the Frobenius inner product between matrices A and B and tr(·) and det[·] represent trace and determinant of a square matrix, respectively. Here, p(x) = M i=1 p x i δ x i represents a discrete probability distribution with respective histogram {p x ∈ R M + : i p x i = 1} supported on x i , where δ x i represents a Kronecker delta function at x i . Throughout, the dimension of the state or observations is denoted by little letters such as x ∈ R m while the number of ensembles or support points of their respective probability distribution is shown by capital letters such as p x ∈ R M + .

Data Assimilation on Euclidean Space
In this section, we provide a brief review of the derivation of classic variational DA and particle filters based on the Bayes' theorem to set the stage for the presented Ensemble Riemannian DA formalism.

Variational Formulation
Let us consider a discrete-time Markovian dynamics and its observations as follows: where x t ∈ R m and y t ∈ R n represent the state variables and the observations at time t, M : R m → R m and H : R m → R n are the deterministic forward model and observation operator, and ω t ∈ R m and v t ∈ R n are the independent and identically distributed model and observation errors, respectively.
Recalling the Bayes' theorem, dropping the time superscript, without loss of generality, the posterior probability density function (pdf) of the state given the observation can be obtained as p(x|y) ∝ p(y|x) p(x)/p(y), where p(y|x) is proportional to the likelihood function, p(x) is the prior density and p(y) denotes the distribution of observations. Letting x b = E X (x) ∈ R m represents the background state, ignoring the constant term log p(y) and assuming Gaussian distributions for the observation error and the prior, logarithm of the posterior density leads to the known three-dimensional variational (3D-Var) cost function [49]: As a result, the analysis state obtained by minimization of the 3D-Var cost function in Eq.
(2) is the mode of the posterior distribution that coincides with the posterior mean when errors are drawn from unbiased Gaussian densities and H is a linear operator. Using the Woodbury matrix inversion lemma [84], it can be easily demonstrated that for a linear observation operator, the analysis states in the 3D-Var and Kalman filter are equivalent [77]. As is evident, zero-mean Gaussian assumptions lead to penalization of the error through the weighted Euclidean norm.

Particle Filters
Particle filters [23,33,81] in DA were introduced to address the issue of non-Gaussian distribution of the state by representing the prior and posterior distributions through a weighted ensemble of model outputs referred to as "particles". In its standard discrete setting, using Monte Carlo simulations, the prior distribution p(x) is represented by a sum of equal-weight Kronecker delta functions as p( variable represented by the i th particle. Each of these M particles are then evolved through the nonlinear model in Eq. (1). Assuming that the conditional distribution p(y|x i ) = 1 (2π) n/2 |R| 1/2 exp − is Gaussian, using the Bayes' theorem, it can be shown that the posterior distribution p(x|y) can be approximated using a set of weighted particles as p(x|y) . The particles are then resampled from the posterior distribution p(x|y) based on their relative weights and propagated forward in time according to the model dynamics.
As is evident, in particle filters, weights of each particle are updated using the Gaussian likelihood function under a zero-mean error assumption. However, in the presence of systematic biases, when the support sets of particles and the observations are disjoint, only the weights of a few particles become significantly large and weights of other particles tend to zero. As the underlying dynamical system progresses in time, only those few particles, with relatively larger weights, are resampled and the filter can become degenerate gradually in time [67].

Optimal Mass Transport
The theory of optimal mass transport (OMT), coined by Gaspard Monge [54] and later extended by Kantorovich [40], was developed to minimize transportation cost in resource allocation problems with purely practical motivations. Recent developments in mathematics discovered that OMT provides a rich ground to compare and morph probability distributions and uncovered new connections to partial differential equations [38,60] and functional analysis [9,12,82].
In a discrete setting, let us define two discrete probability distributions p(x) = M i=1 p x i δ x i and p(y) = N j=1 p y j δ y j with their respective histograms {p x ∈ R M + : i p x i = 1} and {p y ∈ R N + : j p y j = 1} supported on x i and y j . A "ground" transportation cost matrix C ∈ R M ×N + is defined such that its elements c ij = x i − y j q q represent the cost of transporting unit probability masses from location x i to y j . The Kantorovich OMT problem determines an optimal "transportation plan" U a ∈ R M ×N + that can linearly map two probability measures onto each other with minimum amount of total transportation cost as follows: The transportation plan can be interpreted as a "joint distribution" that couples the marginals histograms p x and p y . For the transportation cost with q = 2, the OMT problem in Eq. (3) is convex and defines the square of the 2-Wasserstein distance between the distributions as d 2 W (p x , p y ) = C, U a . What is the advantage of the Wasserstein distance for interpolating between probability distributions compared to other measures of proximity -such as the Hellinger distance [36] or the Kullback-Leibler (KL) divergence [43]? To elaborate on this question, we confine our consideration to the Gaussian densities over which the Wasserstein distance can be represented in a closed form. In particular, interpolating over the 2-Wasserstein space using parameter Fig . 1 shows the spectrum of interpolated distributions between two Gaussian pdfs for a range of the interpolation parameter η ∈ [0, 1]. As shown, the interpolated densities using the Hellinger distance, which is Euclidean in the space of probability measure, are bimodal. Although the Gaussian shape of the interpolated densities using the KL divergence is preserved, the variance of the interpolants is not necessarily bounded by the variances of the input Gaussian densities. Unlike these metrics, as shown, the Wasserstein distance moves the mean and preserves the shape of the interpolants through a natural morphing process.
As is previously noted, this metric is not limited to any Gaussian assumption. Fig. 2 shows the 2-Wasserstein interpolation between a gamma and a Gaussian distribution. The results show the Lagrangian nature of the Wasserstein metric that penalizes the translation and mismatch between the shapes of the pdfs. It can be shown that d 2 where p x and p y are the centered zero-mean probability masses and µ x and µ y are the respective mean values [63].

Problem Formulation
First, let us recall that the weighted mean of a cloud of points for a given family of non-negative weights i η i = 1. This expected value is equivalent to solving the following variational problem: Thus, the 3D-Var problem in Eq. (2), after Cholesky decomposition [57] of the error covariance matrices and rearrangement of the terms, can be interpreted as a "barycenter problem" in the Euclidean space, where the analysis state is the weighted mean of the background state and observation.
By changing the distance metric from Euclidean to the Wasserstein [1], a Riemannian barycenter can be defined as the Fréchet mean [30] of N p probability histograms with finite second-order moments as follows: Inspired by [29], the EnRDA defines the probability distribution of the analysis state p(x a ) ∈ R M as the Fréchet barycenter over the Wasserstein space as follows: where the displacement parameter η > 0 assigns the relative weights to the observation and background term to capture their respective geodesic distances from the true state. Here H (·) is the Jacobian of the observation operator assuming that H : x − → y is a smooth and a square (i.e., m = n) bijective map. The η is a hyperparameter and its optimal value should be determined empirically using some reference data through cross-validation studies. It is also important to note that due to the bijective assumption for the observation operator, the above formalism currently lacks the ability to propagate the information content of observed dimensions to unobserved ones. This limitation is further discussed later on in the section 5.
The solution of the above DA formalism involves finding the optimal analysis transportation plan or the joint distribution U a ∈ R M ×N , using Eq. (3), which couples the background and observation marginal histograms. From the joint histogram U a , we use the McCann's method [52,64] to obtain the analysis probability distribution: where the analysis support points are z ij = η x i + (1 − η) y j . The widely used interior-point methods [2] and the Orlin's [59] algorithm which are used to solve Eq. (3), have super-cubic run time with a computational complexity of O(M 3 log M ), where M = N . This is a limitation in high-dimensional geophysical DA problems that will be addressed in the next subsection.
To solve Eq. (6) in an ensemble setting, let us assume that in the absence of any a priori information, initially the background probability distribution is represented by i = 1 . . . M ensemble members of the state variable x i ∈ R m as p( An a priori assumption is needed to reconstruct the observation distribution p(y) = N i=1 p y j δ y j at j = 1 . . . N supporting points. To that end, one may choose a parametric or a non-parametric model based on the past climatological information. Here, for simplicity, we assume a zeromean Gaussian representation with covariance R ∈ R n×n similar to the suggested approach in [13] that can be used to perturb the given observation at each assimilation cycle. After each assimilation cycle, the probability histogram of the analysis state p(x a ) is recovered from Eq. (8) over z ij at M × N support points. Then p(x a ) is resampled at M points using the multinomial sampling scheme [48] to initialize the next time step forecasts.

Entropic Regularization of EnRDA
In order to speed up the computation of coupling between p x b and p y , the problem in Eq. (3) can be regularized [18] as follows: where γ > 0 is the regularization parameter and H(U) := U, log U − 1 M 1 T N represents the Gibbs-Boltzmann relative entropy function. Note that the relative entropy is a concave function and thus its negative value is convex.
The Lagrangian function (L) of Eq. (8) can be obtained by adding two dual variables or Lagrangian multipliers q ∈ R M and r ∈ R N as follows: Setting the derivative of the Lagrangian function to zero, we have The convexity of the entropic regularization keeps the problem in Eq. (8) strongly convex and it can be shown [64] that Eq. (10) leads to a unique optimal joint density with the following form: where v = exp(q) (γ1 M ) ∈ R M and w = exp(r) (γ1 N ) ∈ R N are the unknown scaling variables and K ∈ R M ×N + is the Gibbs kernel, associated with cost matrix C with element From the mass conservation constraints in Eq. (8) and scaling form of the optimal joint density in Eq. (11), we can derive The two unknown scaling variables v and w in Eq. (11) can be iteratively solved using the Sinkhorn's algorithm [18] as follows: A summary of the EnRDA implementation is demonstrated in Algorithm 1.
The entropic regularization parameter γ plays an important role in characterization of the joint density; however, there exists no closed-form solution for its optimal selection. Generally speaking, increasing the value of γ will increase convexity of the cost function and thus computational efficiency; however, at the expense of reduced coupling between the marginal histograms, consistent with the second law of thermodynamics.
As an example, the effects of γ on the coupling between two Gaussian mixture models p x b and p y are demonstrated in Fig. 3. It can be seen that at smaller values of γ = 0.001, the probability masses of the joint distribution are sparse and lie compactly along the main diagonal -capturing a strong coupling between the background state and observations. However, as the value of γ increases, the probability masses of the joint distribution spread out -reflecting less degree of dependencies between the marginals. It is important to note that in limiting cases, as γ → 0, the solution of Eq. (8) converges to the true optimal joint histogram, while as γ → ∞ the entropy of the analysis state increases and tends to p x b p T y . Throughout, we empirically choose a minimum value for γ that leads to a stable solution by the Sinkhorn algorithm, assuring sufficient fidelity to the optimal transportation of probability masses according to Eq. (8).

Numerical Experiments and Results
In order to demonstrate the performance of the EnRDA and quantify its effectiveness, we focus on the linear advection-diffusion equation and the chaotic Lorenz-63 model [50]. The advection-diffusion model explains a wide range of heat, mass, and momentum transport across the land, vegetation, and atmospheric continuum, and has been utilized to evaluate the performance of DA methodologies [10,26,37,58,85]. Similarly, the Lorenz-63 model, as a chaotic model of atmospheric convection, has been widely used in testing the performance of DA methodologies [32,53,56,75,76,80]. Throughout, under controlled experimental . . , M.

6:
At initial time obtain probability histogram of the background state and observations: p y j δ y t j .

9:
Obtain M analysis ensemble members x ai ∈ R m by multinomial sampling from p(x a ).

10:
Set x t i := x ai . 11: end for settings with foreknown model and observation errors, we run the forward models under systematic errors and compare the results of the EnRDA with 3D-Var for advection-diffusion dynamics and with the particle filter and EnKF for the Lorenz-63 system.

State-space Characterization
The advection-diffusion is a special case of the Navier-Stokes partial differential equation. In its linear form, with constant diffusivity in an incompressible fluid flow, it is expressed for a mass conserved physical quantity x(s, t) as follows: where s ∈ R n represents a n−dimensional spatial domain at time t. In the above expression, a = (a 1 , . . . , a n ) T ∈ R n is the advection velocity vector and D = diag(D 1 , . . . , D n ) ∈ R n×n represents the diffusivity matrix. Given initial condition x(s, t = 0), owing to its linearity, the solution at time t can be obtained by convolving the initial condition with a Kronecker delta function δ(s − a t) followed by a convolution with the fundamental Gaussian kernel

Experimental Setup and Results
In  root mean squared error (ubrmse). The shape of the entire state-space is properly preserved and remains closer to the ground truth. As shown, in the 3D-Var, although the analysis state follows the true state reasonably well for initial time steps, as the system propagates over time under systematic errors, the analysis state deviates further away from the ground truth. It is important to note that the displacement parameter in EnRDA is largely determined by the bias while the relative weights in 3D-Var are solely based on the background and observation errors. In particular, under the 3D-Var experiment, the average value of the relative weight we assign to the background state α = tr(R)/tr(R + B) is around 0.4 while in EnRDA this weight is η = 0.2. Thus in EnRDA, we favored the unbiased observations more and thus the observed improvements in comparison with the 3D-Var might not be completely fair. Because the displacement parameter η can be tuned for example based on the mean squared error that encompasses the effect of bias but there is no such a mechanism available in 3D-Var. Can EnRDA improve the analysis uncertainty even when α and η are comparable? To resemble a model with systematic errors, background state is obtained by increasing the advective velocity to 0.12 [L/T] while diffusivity is reduced to 0.01 [L 2 /T] (Fig. 5b). Observations are not considered to have position biases; however, a systematic representative error is imposed assuming that the sensing system has a lower resolution than the model. To that end, we evolve two Kronecker delta functions, x(s, t) = 800 δ(s 1 , s 2 ) and x(s, t) = 2400 δ(s 1 , s 2 ), with less mass than the true state for same time period of 25 and 35 [t] and then up-scaled the field by a factor of two through box averaging.
As shown in Fig. 5, the EnRDA preserves the shape of the state variable well and gradually moves the mass towards the background state as the value of η increases, while the bias remains almost constant and the ubrmse increases from 0.12 to 0.95. The error quality metrics are constantly below the 3D-Var counterpart. The shape of the analysis state for small values of α is not well recovered in 3D-Var due to the position bias. As α increases from 0.25 to 0.75, 3D-Var nudges the analysis state towards the background state and begins to recover the shape. The bias is reduced by more than 30%, from 0.15 to 0.05; however, this occurs at the expense of almost three folds increase in ubrmse, from 0.3 to 1.1. The reason for reduction of the bias is that the positive differences between the analysis state and true state are compensated by their negative differences. However, ubrmse is quadratic and thus measures the average magnitude of the error irrespective of its signs. We should emphasize that the presented results do not imply that EnRDA is always superior to 3D-Var. Indeed, 3D-Var is a minimum mean squared estimator and cannot be outperformed by EnRDA in the absence of bias in a state space with Gaussian distribution.

Lorenz-63 4.2.1 State-space Characterization
The Lorenz system 50] is derived through truncation of the Fourier series of the Rayleigh-Bénard convection model. This model can be interpreted as a simplistic local weather system only involving the effect of local shear stress and buoyancy forces. The system is expressed using coupled ordinary differential equations that describe the temporal evolution of three coordinates x, y, and z representing the rate of convective overturn, horizontal, and vertical temperature variations as: where σ represents the Prandtl number, ρ is a normalized Rayleigh number proportional to the difference in temperature gradient through the depth of the fluid and β denotes a horizontal wave number of the convective motion. It is well established that for parameter values of σ = 10, ρ = 28 and β = 8/3, the system exhibits chaotic behavior with the phase space revolving around two unstable stationary points located at ( β(ρ − 1), β(ρ − 1), ρ− 1) and (− β(ρ − 1), − β(ρ − 1), ρ − 1).

Experimental Setup and Results
In this subsection, we demonstrate the results of DA in the Lorenz system under systematic error using EnRDA, particle filter and EnKF. Throughout, we use the classic multinomial resampling for implementation of EnRDA and particle filter. Apart from the systematic error component, we utilize the standard experimental setting used in numerous DA studies [5,31,53,80]. In order to obtain the ground truth of the model trajectory, the system is initialized at x 0 = (1.508870, −1.531271, 25.46091) and integrated with a time step of ∆t = 0.01 over a time period of T = 0-20 [t] using the fourth-order Runge-Kutta approximation [44,72]. The observations are obtained at every assimilation interval 40∆t by assuming identity observation operator and perturbing the ground truth with Gaussian noise v t ∼ N (0, σ 2 obs Σ ρ ), where σ 2 obs = 2 and the correlation matrix Σ ρ ∈ R 3×3 is populated with 1 on the diagonal entries, 0.5 on the first sub and super diagonals, and 0.25 on the second sub and super diagonals.
In order to characterize the distribution of the background state, 100 particles (ensemble Figure 6: Temporal evolution of the true state x tr of the Lorenz-63, observations y as well as the analysis state x a for the particle filter (PF) (first column), EnKF (EnKF) (second column) and EnRDA (EnRDA) (third column) with 100 particles (ensemble members) respectively. The temporal evolution of the particles and ensemble members are shown with solid gray lines. Also shown within dashed rectangles are the windows of time over which support sets of observations and particle spread are disjoint in particle filter as well as EnKF and EnRDA deviate from the ground truth. members) of particle filter, EnKF, and EnRDA are generated by corrupting the ground truth at the initial time with a zero-mean Gaussian noise ω 0 ∼ N (0, σ 2 0 I 3 ), where σ 2 0 = 2. For introducing systematic errors, the model parameters are set to σ = 10.5, ρ = 27, and β = 10/3. The random errors are also introduced as the system evolves in time by adding a Gaussian noise ω t ∼ N (0, σ 2 b I 3 ) at every ∆t, with σ 2 b = 0.02. Throughout, to draw a robust statistical conclusion about the error statistics, the DA experiments are repeated for 50 independent simulations. As described previously, to properly account for the effects of both bias and ubrmse, the optimal value of the displacement parameter η in EnRDA can be selected based on an offline analysis of the minimum mean squared analysis or forecast error. However, to provide a fair comparison between the EnRDA and other filtering methods, at each assimilation cycle, we set η = tr(R)/tr(R + B) assuming that the observation operator is an identity matrix. Note that while the observation error covariance remains constant in time, the background error covariance is obtained from simulated ensembles by EnRDA and changes in time dynamically. This selection assures that the relative weights assigned to the background state and observations remain at the same order of magnitude among different methods. Fig. 6 shows the temporal evolution of the ground truth and the analysis state by the particle filter (first column), EnKF (second column), and EnRDA (third column) over a time period of 0 to 15 [t] for one simulation. As is evident, the particle filter is well capable of capturing the ground truth when the observations lie within the particle spread. However, when the observations lie far apart from the support set of particles (Fig. 6, dashed box) and the pdfs of the background state and observations become disjoint, the filter becomes degenerate and the analysis state (particle mean) deviates away from the ground truth. It is to note that due to the systematic error, the particles in the z-coordinate lie away from the observations and the trajectory fluctuates around the mean of the ground truth ( Fig. 6 g, dashed box). As a result, the bias of the particle filter along the z-dimension is markedly lower than that of the EnKF and the EnRDA while ubrmse is significantly higher. Whereas, both EnKF and EnRDA are capable of capturing the true state well even when ensemble spread and observations are far apart from each other. Although EnKF does not suffer from the same problem of filter degeneracy as the particle filter, in earlier time steps from 2.5 to 7.5 [t], it struggles to adequately nudge the analysis state towards the ground truth when ensemble members are far from the observations due to the imposed systematic bias. EnRDA seems to be robust to the propagation of systematic biases in this region and follows the true trajectory well.
The time evolution of the mean values of the bias and ubrmse for 50 independent simulations, with the same error structure, is color coded over the phase space in Fig. 7. As is evident, these forecast quality metrics are relatively lower for the EnRDA than the EnKF and particle filter throughout the simulation period. Nevertheless, we can see that the improvement compared to the EnKF is modest. In particular, across all dimensions of the problem, the mean bias and ubrmse are decreased in EnRDA by 68 (13)% and 53 (27)%

Discussion and Concluding Remarks
In this study, we introduced an ensemble data assimilation (DA) methodology over a Riemannian manifold, namely Ensemble Riemannian DA (EnRDA), and illustrated its performance in comparison with a few Euclidean DA techniques for dissipative and chaotic dynamics.. We demonstrated that the presented methodology is capable of assimilating information in probability domain -characterized by the families of distributions with finite second-order moments. The key message is that when the probability distribution of the forecast and observations exhibit non-Gaussian structure and their support sets are disjoint, due to the presence of systematic errors; the Wasserstein metric can be leveraged to potentially extend geophysical forecast skills. Even though, future research for a comprehensive comparison with existing filtering and bias correction methodologies is needed to completely characterize relative pros and cons of the proposed approach -especially when it comes to the ensemble size and optimal selection of the displacement parameter η.
We explained the role of regularization and displacement parameter in EnRDA and empirically examined their effects on the optimal joint histogram, coupling the background state and observations, and consequently on the analysis state. Nevertheless, future studies are required to characterize closed-form or heuristic expressions to expand our understating of their impacts on the forecast uncertainty. As was explained earlier, unlike the Euclidean DA methodologies that assimilate available information using different relative weights across multiple dimensions through the error covariance matrices; a scalar displacement parameter is utilized in the the EnRDA that interpolates uniformly between all dimensions of the problem. Future research can be devoted to developing a framework that utilizes a vector representation of the displacement parameters to effectively tackle possible heterogeneity of uncertainty across multiple dimensions.
In it's current form, the EnRDA requires the observation operator to be smooth and bijectve. This is a limitation when observations of all problem dimensions are not available and propagation of observations to non-observed dimensions is desired. Extending the EnRDA methodology to include partially observed systems seems to be an important future research area. This could include performing a rough inversion for unobserved components of the system offline or extending the methodology in the direction of particle flows [81].
Lastly, we should mention that the EnRDA is computationally expensive as it involves estimation of the coupling through the Wasserstein distance. On a desktop machine with a 3.4 GHz CPU clock rate, it took around 1600 s to complete 50 independent simulations on Lorenz-63 for the EnRDA compared to 651 (590) s for the particle filter (EnKF) with 100 particles (ensemble members). Since the computational cost is nonlinearly related to the problem dimension, it is expected that it grows significantly for large-scale geophysical DA and becomes a limiting factor. Although the entropic regularization works well for the presented low dimensional problems, future research is needed to test its efficiency in high-dimensional problems. Constraining the solution of the coupling on a submanifold of probability distributions with a Gaussian mixture structure [17] can be a future research direction for lowering the computational cost.