Ensemble Riemannian data assimilation over the Wasserstein space

In this paper, we present an ensemble data assimilation paradigm over a Riemannian manifold equipped with the Wasserstein metric. Unlike the Euclidean distance used in classic data assimilation methodologies, the Wasserstein metric can capture the translation and difference between the shapes of square-integrable probability distributions of the background state and observations. This enables us to formally penalize geophysical biases in state space with nonGaussian distributions. The new approach is applied to dissipative and chaotic evolutionary dynamics, and its potential advantages and limitations are highlighted compared to the classic ensemble data assimilation approaches under systematic errors.

which the notion of distance between probability distributions is geodesic. Such a distance metric shall be Lagrangian to not only capture translation but also the difference between the entire shape of probability distributions (Pennec, 2006). How can we equip DA with a Riemannian geometry? To answer this question, inspired by the theories of optimal mass transport (Villani, 2003), this paper presents the Ensemble Riemannian Data Assimilation (EnRDA) framework using the Wasserstein distance metric. 55 In recent years, a few attempts have been made to utilize the Wasserstein metric in geophysical DA. Reich (2013) introduced an ensemble transform particle filter, where the optimal transport framework was utilized to guide the resampling phase of the filter. Ning et al. (2014) used the Wasserstein distance to reduce forecast uncertainty due to parameter estimation errors in dissipative evolutionary equations. Feyeux et al. (2018) suggested a novel approach employing the Wasserstein distance in lieu of the Euclidean dis-60 tance to penalize the position error between state and observations. More recently, Tamang et al. (2020) introduced a Wasserstein regularization in a variational setting to correct for geophysical biases under chaotic dynamics.
The EnRDA extends the previous work through the following main contributions: (a) EnRDA defines DA as a discrete barycenter problem over the Wasserstein space for assimilation in probability domain 65 without any parametric or Gaussian assumption. The framework provides a continuum of non-parametric analysis probability histograms that naturally span between the distributions of the background state and observations through optimal transport of probability masses. (b) EnRDA operates in an ensemble setting using the entropic regularization by utilizing the Sinkhorn algorithm (Cuturi, 2013)  The organization of the paper is as follows: Section 2 provides a brief background on Bayesian DA formulations and optimal mass transport. The mathematical formalism of the EnRDA is described in Section 3. Section 4 presents the results and compares them with their Euclidean counterparts. Section 5 75 discusses the findings and ideas for future research. an m-element vector of ones and I m is an m × m identity matrix. A diagonal matrix with entries given by x ∈ R m is represented by diag(x) ∈ R m×m . Notation x ∼ N (µ, Σ) denotes that the random vector x is drawn from a Gaussian distribution with mean µ and covariance Σ and E X (x) is the expectation of x. The q -norm of x is defined as x q = m i=1 |x i | q 1/q with q > 0 and the square of the weighted Notations of x y and x y represent the element-wise Hadamard product and division between equal length vectors x and y, respectively. Notation A, B = tr(A T B) denotes the Frobenius inner product between matrices A and B and tr(·) and det[·] represent trace and determinant of a square matrix, respectively. Here, 1} supported on x i , where δ x i represents a Kronecker delta function at x i . Throughout, the dimension of the state or observations is denoted by little letters such as x ∈ R m while the number of ensembles or support points of their respective probability distribution is shown by capital letters such as p x ∈ R M + .

Data Assimilation on Euclidean Space
In this section, we provide a brief review of the derivation of classic variational DA and particle filters 95 based on the Bayes' theorem to set the stage for the presented Ensemble Riemannian DA formalism.

Variational Formulation
Let us consider a discrete-time Markovian dynamics and its observations as follows: where x t ∈ R m and y t ∈ R n represent the state variables and the observations at time t, M : R m → R m 100 and H : R m → R n are the deterministic forward model and observation operator, and ω t ∈ R m and v t ∈ R n are the independent and identically distributed model and observation errors, respectively.
Recalling the Bayes' theorem, dropping the time superscript, without loss of generality, the posterior probability density function (pdf) of the state given the observation can be obtained as p(x|y) ∝ p(y|x) p(x)/p(y), where p(y|x) is proportional to the likelihood function, p(x) is the prior density and 105 p(y) denotes the distribution of observations. Letting x b = E X (x) ∈ R m represents the background state, ignoring the constant term log p(y) and assuming Gaussian distributions for the observation error and the prior, logarithm of the posterior density leads to the known three-dimensional variational (3D-Var) cost function (Lorenc, 1986): (2)

110
As a result, the analysis state obtained by minimization of the 3D-Var cost function in Eq.
(2) is the mode of the posterior distribution that coincides with the posterior mean when errors are drawn from unbiased Gaussian densities and H is a linear operator. Using the Woodbury matrix inversion lemma (Woodbury, 1950), it can be easily demonstrated that for a linear observation operator, the analysis states in the 3D-Var and Kalman filter are equivalent (Tarantola, 1987). As is evident, zero-mean Gaussian assumptions lead 115 to penalization of the error through the weighted Euclidean norm.

Particle Filters
Particle filters (Gordon et al., 1993;Doucet and Johansen, 2009;Van Leeuwen et al., 2019) in DA were introduced to address the issue of non-Gaussian distribution of the state by representing the prior and posterior distributions through a weighted ensemble of model outputs referred to as "particles". In its standard discrete setting, using Monte Carlo simulations, the prior distribution p(x) is represented by a sum of equal-weight Kronecker delta functions as p( represented by the i th particle. Each of these M particles are then evolved through the nonlinear model in Eq. (1). Assuming that the using the Bayes' theorem, it can be shown that the posterior distribution p(x|y) can be approximated using a set of weighted particles as p(x|y) . The particles are then resampled from the posterior distribution p(x|y) based on their relative weights and propagated forward in time according to the model dynamics.
As is evident, in particle filters, weights of each particle are updated using the Gaussian likelihood func-130 tion under a zero-mean error assumption. However, in the presence of systematic biases, when the support sets of particles and the observations are disjoint, only the weights of a few particles become significantly large and weights of other particles tend to zero. As the underlying dynamical system progresses in time, only those few particles, with relatively larger weights, are resampled and the filter can become degenerate gradually in time (Poterjoy and Anderson, 2016).

Optimal Mass Transport
The theory of optimal mass transport (OMT), coined by Gaspard Monge (Monge, 1781) and later extended by Kantorovich (Kantorovich, 1942), was developed to minimize transportation cost in resource allocation problems with purely practical motivations. Recent developments in mathematics discovered that OMT provides a rich ground to compare and morph probability distributions and uncovered new connections 140 to partial differential equations (Jordan et al., 1998;Otto, 2001) and functional analysis (Brenier, 1987;Benamou and Brenier, 2000;Villani, 2003).
In a discrete setting, let us define two discrete probability distributions p(x) = M i=1 p x i δ x i and p(y) = N j=1 p y j δ y j with their respective histograms {p x ∈ R M + : i p x i = 1} and {p y ∈ R N + : j p y j = 1} supported on x i and y j . A "ground" transportation cost matrix C ∈ R M ×N + is defined such that its elements 145 c ij = x i − y j q q represent the cost of transporting unit probability masses from location x i to y j . The Kantorovich OMT problem determines an optimal "transportation plan" U a ∈ R M ×N + that can linearly map two probability measures onto each other with minimum amount of total transportation cost as follows: (3) Figure 1. Interpolation between two Gaussian distributions N (µ1, σ 2 1 ) and N (µ2, σ 2 2 ) where, µ1 = −1.1, µ2 = 1.4, σ 2 1 = 0.4, and σ 2 2 = 0.01 as a function of displacement parameter η ∈ [0, 1] for the (a) Hellinger distance, (b) Kullback-Leibler divergence, and (c) 2-Wasserstein distance .
The transportation plan can be interpreted as a "joint distribution" that couples the marginals histograms p x and p y . For the transportation cost with q = 2, the OMT problem in Eq. (3) is convex and defines the square of the 2-Wasserstein distance between the distributions as d 2 W (p x , p y ) = C, U a . What is the advantage of the Wasserstein distance for interpolating between probability distributions compared to other measures of proximity -such as the Hellinger distance (Hellinger, 1909) or the 155 Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951)? To elaborate on this question, we confine our consideration to the Gaussian densities over which the Wasserstein distance can be represented in a closed form. In particular, interpolating over the 2-Wasserstein space using parameter η, between Chen et al., 2019b).
160 Fig. 1 shows the spectrum of interpolated distributions between two Gaussian pdfs for a range of the interpolation parameter η ∈ [0, 1]. As shown, the interpolated densities using the Hellinger distance, which is Euclidean in the space of probability measure, are bimodal. Although the Gaussian shape of the interpolated densities using the KL divergence is preserved, the variance of the interpolants is not necessarily bounded by the variances of the input Gaussian densities. Unlike these metrics, as shown, the Wasserstein 165 distance moves the mean and preserves the shape of the interpolants through a natural morphing process. As is previously noted, this metric is not limited to any Gaussian assumption. Fig. 2 shows the 2-Wasserstein interpolation between a gamma and a Gaussian distribution. The results show the Lagrangian nature of the Wasserstein metric that penalizes the translation and mismatch between the shapes of the pdfs. It can be shown that d 2 where p x and p y are the centered 170 zero-mean probability masses and µ x and µ y are the respective mean values .

Problem Formulation
First, let us recall that the weighted mean of a cloud of points for a given family of non-negative weights i η i = 1. This expected value is equivalent to 175 solving the following variational problem: Thus, the 3D-Var problem in Eq.
(2), after Cholesky decomposition (Nash, 1990) of the error covariance matrices and rearrangement of the terms, can be interpreted as a "barycenter problem" in the Euclidean space, where the analysis state is the weighted mean of the background state and observation.

180
By changing the distance metric from Euclidean to the Wasserstein (Agueh and Carlier, 2011), a Riemannian barycenter can be defined as the Fréchet mean (Fréchet, 1948) of N p probability histograms with finite second-order moments as follows: Inspired by (Feyeux et al., 2018), the EnRDA defines the probability distribution of the analysis state 185 p(x a ) ∈ R M as the Fréchet barycenter over the Wasserstein space as follows: where the displacement parameter η > 0 assigns the relative weights to the observation and background term to capture their respective geodesic distances from the true state. Here H (·) is the Jacobian of the observation operator assuming that H : x − → y is a smooth and a square (i.e., m = n) bijective map. The 190 η is a hyperparameter and its optimal value should be determined empirically using some reference data through cross-validation studies. It is also important to note that due to the bijective assumption for the observation operator, the above formalism currently lacks the ability to propagate the information content of observed dimensions to unobserved ones. This limitation is further discussed later on in the section 5.
The solution of the above DA formalism involves finding the optimal analysis transportation plan or 195 the joint distribution U a ∈ R M ×N , using Eq. (3), which couples the background and observation marginal histograms. From the joint histogram U a , we use the McCann's method (McCann, 1997; to obtain the analysis probability distribution: where the analysis support points are z ij = η x i + (1 − η) y j . The widely used interior-point methods (Alt- To solve Eq. (6) in an ensemble setting, let us assume that in the absence of any a priori information, initially the background probability distribution is represented by i = 1 . . . M ensemble members of the An a priori assumption is needed to reconstruct the observation distribution p(y) = N i=1 p y j δ y j at j = 1 . . . N supporting points. To that end, one may choose a parametric or a non-parametric model based on the past climatological information. Here, for simplicity, we assume a zero-mean Gaussian representation with covariance R ∈ R n×n similar to the suggested approach in (Burgers et al., 1998) that can be used to perturb the given observation at each assimilation cy-210 cle. After each assimilation cycle, the probability histogram of the analysis state p(x a ) is recovered from Eq. (8) over z ij at M × N support points. Then p(x a ) is resampled at M points using the multinomial sampling scheme (Li et al., 2015) to initialize the next time step forecasts.

Entropic Regularization of EnRDA
In order to speed up the computation of coupling between p x b and p y , the problem in Eq. (3) can be 215 regularized (Cuturi, 2013) as follows: where γ > 0 is the regularization parameter and H(U) := U, log U − 1 M 1 T N represents the Gibbs-Boltzmann relative entropy function. Note that the relative entropy is a concave function and thus its negative value is convex.

220
The Lagrangian function (L) of Eq. (8) can be obtained by adding two dual variables or Lagrangian multipliers q ∈ R M and r ∈ R N as follows: Setting the derivative of the Lagrangian function to zero, we have

225
The convexity of the entropic regularization keeps the problem in Eq. (8) strongly convex and it can be shown  that Eq. (10) leads to a unique optimal joint density with the following form: where v = exp(q) (γ1 M ) ∈ R M and w = exp(r) (γ1 N ) ∈ R N are the unknown scaling variables and is the Gibbs kernel, associated with cost matrix C with element k ij = exp (− c ij γ ).

230
From the mass conservation constraints in Eq. (8) and scaling form of the optimal joint density in Eq. (11), we can derive The two unknown scaling variables v and w in Eq. (11) can be iteratively solved using the Sinkhorn's algorithm (Cuturi, 2013) as follows: A summary of the EnRDA implementation is demonstrated in Algorithm 1.
The entropic regularization parameter γ plays an important role in characterization of the joint density; however, there exists no closed-form solution for its optimal selection. Generally speaking, increasing the value of γ will increase convexity of the cost function and thus computational efficiency; however, 240 at the expense of reduced coupling between the marginal histograms, consistent with the second law of thermodynamics.
As an example, the effects of γ on the coupling between two Gaussian mixture models p x b and p y are demonstrated in Fig. 3. It can be seen that at smaller values of γ = 0.001, the probability masses of the joint distribution are sparse and lie compactly along the main diagonal -capturing a strong coupling 245 between the background state and observations. However, as the value of γ increases, the probability masses of the joint distribution spread out -reflecting less degree of dependencies between the marginals.
It is important to note that in limiting cases, as γ → 0, the solution of Eq. (8) converges to the true optimal joint histogram, while as γ → ∞ the entropy of the analysis state increases and tends to p x b p T y . Throughout, we empirically choose a minimum value for γ that leads to a stable solution by the Sinkhorn . . , M.

6:
At initial time obtain probability histogram of the background state and observations: p yj δ y t j .

7:
Compute the joint histogram as follows: 9: Obtain M analysis ensemble members x ai ∈ R m by multinomial sampling from p(x a ).  algorithm, assuring sufficient fidelity to the optimal transportation of probability masses according to Eq. (8).
In order to demonstrate the performance of the EnRDA and quantify its effectiveness, we focus on the linear advection-diffusion equation and the chaotic Lorenz-63 model (Lorenz, 1963). The advection-255 diffusion model explains a wide range of heat, mass, and momentum transport across the land, vegetation, and atmospheric continuum, and has been utilized to evaluate the performance of DA methodologies (Zhang et al., 1997;Hurkmans et al., 2006;Ning et al., 2014;Ebtehaj et al., 2014;Berardi et al., 2016).
Similarly, the Lorenz-63 model, as a chaotic model of atmospheric convection, has been widely used in testing the performance of DA methodologies (Miller et al., 1994;Nakano et al., 2007;Van Leeuwen, 260 2010; Goodliff et al., 2015;Tandeo et al., 2015;Tamang et al., 2020). Throughout, under controlled experimental settings with foreknown model and observation errors, we run the forward models under systematic errors and compare the results of the EnRDA with 3D-Var for advection-diffusion dynamics and with the particle filter and EnKF for the Lorenz-63 system. The advection-diffusion is a special case of the Navier-Stokes partial differential equation. In its linear form, with constant diffusivity in an incompressible fluid flow, it is expressed for a mass conserved physical quantity x(s, t) as follows:

Advection-Diffusion Equation
270 where s ∈ R n represents a n−dimensional spatial domain at time t. In the above expression, a = (a 1 , . . . , a n ) T ∈ R n is the advection velocity vector and D = diag(D 1 , . . . , D n ) ∈ R n×n represents the diffusivity matrix. Given initial condition x(s, t = 0), owing to its linearity, the solution at time t can be obtained by convolving the initial condition with a Kronecker delta function δ(s − a t) followed by a convolution with the fundamental Gaussian kernel G(s, t) = 1 (2π) n/2 |Σ| 1/2 exp − 1 2 s T Σ −1 s , where Σ = 2 D t.

295
The shape of the entire state-space is properly preserved and remains closer to the ground truth. As shown, in the 3D-Var, although the analysis state follows the true state reasonably well for initial time steps, as the system propagates over time under systematic errors, the analysis state deviates further away from the ground truth. It is important to note that the displacement parameter in EnRDA is largely determined by the bias while the relative weights in 3D-Var are solely based on the background and observation   (Fig. 5b). Observations are not considered 315 to have position biases; however, a systematic representative error is imposed assuming that the sensing system has a lower resolution than the model. To that end, we evolve two Kronecker delta functions, x(s, t) = 800 δ(s 1 , s 2 ) and x(s, t) = 2400 δ(s 1 , s 2 ), with less mass than the true state for same time period of 25 and 35 [t] and then up-scaled the field by a factor of two through box averaging.
As shown in Fig. 5, the EnRDA preserves the shape of the state variable well and gradually moves the 320 mass towards the background state as the value of η increases, while the bias remains almost constant and the ubrmse increases from 0.12 to 0.95. The error quality metrics are constantly below the 3D-Var counterpart. The shape of the analysis state for small values of α is not well recovered in 3D-Var due to the position bias. As α increases from 0.25 to 0.75, 3D-Var nudges the analysis state towards the background state and begins to recover the shape. The bias is reduced by more than 30%, from 0.15 to 325 0.05; however, this occurs at the expense of almost three folds increase in ubrmse, from 0.3 to 1.1. The reason for reduction of the bias is that the positive differences between the analysis state and true state are compensated by their negative differences. However, ubrmse is quadratic and thus measures the average magnitude of the error irrespective of its signs. We should emphasize that the presented results do not imply that EnRDA is always superior to 3D-Var. Indeed, 3D-Var is a minimum mean squared estimator 330 and cannot be outperformed by EnRDA in the absence of bias in a state space with Gaussian distribution.

State-space Characterization
The Lorenz system (Lorenz-63, Lorenz, 1963) is derived through truncation of the Fourier series of the Rayleigh-Bénard convection model. This model can be interpreted as a simplistic local weather system 335 only involving the effect of local shear stress and buoyancy forces. The system is expressed using coupled ordinary differential equations that describe the temporal evolution of three coordinates x, y, and z representing the rate of convective overturn, horizontal, and vertical temperature variations as: where σ represents the Prandtl number, ρ is a normalized Rayleigh number proportional to the difference 340 in temperature gradient through the depth of the fluid and β denotes a horizontal wave number of the convective motion. It is well established that for parameter values of σ = 10, ρ = 28 and β = 8/3, the system exhibits chaotic behavior with the phase space revolving around two unstable stationary points located at ( β(ρ − 1), β(ρ − 1), ρ − 1) and (− β(ρ − 1), − β(ρ − 1), ρ − 1).

345
In this subsection, we demonstrate the results of DA in the Lorenz system under systematic error using En-RDA, particle filter and EnKF. Throughout, we use the classic multinomial resampling for implementation of EnRDA and particle filter. Apart from the systematic error component, we utilize the standard experimental setting used in numerous DA studies (Miller et al., 1994;Furtado et al., 2008;Van Leeuwen, 2010; at x 0 = (1.508870, −1.531271, 25.46091) and integrated with a time step of ∆t = 0.01 over a time period of T = 0-20 [t] using the fourth-order Runge-Kutta approximation (Runge, 1895;Kutta, 1901). The observations are obtained at every assimilation interval 40∆t by assuming identity observation operator and perturbing the ground truth with Gaussian noise v t ∼ N (0, σ 2 obs Σ ρ ), where σ 2 obs = 2 and the correlation matrix Σ ρ ∈ R 3×3 is populated with 1 on the diagonal entries, 0.5 on the first sub and super diagonals, 355 and 0.25 on the second sub and super diagonals.
In order to characterize the distribution of the background state, 100 particles (ensemble members) of particle filter, EnKF, and EnRDA are generated by corrupting the ground truth at the initial time with a zero-mean Gaussian noise ω 0 ∼ N (0, σ 2 0 I 3 ), where σ 2 0 = 2. For introducing systematic errors, the model parameters are set to σ = 10.5, ρ = 27, and β = 10/3. The random errors are also introduced as the sys-360 tem evolves in time by adding a Gaussian noise ω t ∼ N (0, σ 2 b I 3 ) at every ∆t, with σ 2 b = 0.02. Throughout, to draw a robust statistical conclusion about the error statistics, the DA experiments are repeated for 50 independent simulations. As described previously, to properly account for the effects of both bias and ubrmse, the optimal value of the displacement parameter η in EnRDA can be selected based on an offline analysis of the minimum mean squared analysis or forecast error. However, to provide a fair comparison 365 between the EnRDA and other filtering methods, at each assimilation cycle, we set η = tr(R)/tr(R + B) assuming that the observation operator is an identity matrix. Note that while the observation error covariance remains constant in time, the background error covariance is obtained from simulated ensembles by EnRDA and changes in time dynamically. This selection assures that the relative weights assigned to the background state and observations remain at the same order of magnitude among different methods. 370 Fig. 6 shows the temporal evolution of the ground truth and the analysis state by the particle filter (first column), EnKF (second column), and EnRDA (third column) over a time period of 0 to 15 [t] for one simulation. As is evident, the particle filter is well capable of capturing the ground truth when the observations lie within the particle spread. However, when the observations lie far apart from the support set of particles (Fig. 6, dashed box) and the pdfs of the background state and observations become disjoint, 375 the filter becomes degenerate and the analysis state (particle mean) deviates away from the ground truth. It is to note that due to the systematic error, the particles in the z-coordinate lie away from the observations and the trajectory fluctuates around the mean of the ground truth (Fig. 6 g, dashed box). As a result, the bias of the particle filter along the z-dimension is markedly lower than that of the EnKF and the EnRDA Figure 6. Temporal evolution of the true state xtr of the Lorenz-63, observations y as well as the analysis state xa for the particle filter (PF) (first column), EnKF (EnKF) (second column) and EnRDA (EnRDA) (third column) with 100 particles (ensemble members) respectively. The temporal evolution of the particles and ensemble members are shown with solid gray lines. Also shown within dashed rectangles are the windows of time over which support sets of observations and particle spread are disjoint in particle filter as well as EnKF and EnRDA deviate from the ground truth. while ubrmse is significantly higher. Whereas, both EnKF and EnRDA are capable of capturing the true 380 state well even when ensemble spread and observations are far apart from each other. Although EnKF does not suffer from the same problem of filter degeneracy as the particle filter, in earlier time steps from 2.5 to 7.5 [t], it struggles to adequately nudge the analysis state towards the ground truth when ensemble members are far from the observations due to the imposed systematic bias. EnRDA seems to be robust to the propagation of systematic biases in this region and follows the true trajectory well.

385
The time evolution of the mean values of the bias and ubrmse for 50 independent simulations, with the same error structure, is color coded over the phase space in Fig. 7. As is evident, these forecast quality metrics are relatively lower for the EnRDA than the EnKF and particle filter throughout the simulation period. Nevertheless, we can see that the improvement compared to the EnKF is modest. In particular, across all dimensions of the problem, the mean bias and ubrmse are decreased in EnRDA by 68 (13)   In this study, we introduced an ensemble data assimilation (DA) methodology over a Riemannian manifold, namely Ensemble Riemannian DA (EnRDA), and illustrated its performance in comparison with a few Euclidean DA techniques for dissipative and chaotic dynamics.. We demonstrated that the presented methodology is capable of assimilating information in probability domain -characterized by the families of distributions with finite second-order moments. The key message is that when the probability distri-400 bution of the forecast and observations exhibit non-Gaussian structure and their support sets are disjoint, due to the presence of systematic errors; the Wasserstein metric can be leveraged to potentially extend geophysical forecast skills. Even though, future research for a comprehensive comparison with existing filtering and bias correction methodologies is needed to completely characterize relative pros and cons of the proposed approach -especially when it comes to the ensemble size and optimal selection of the 405 displacement parameter η.
We explained the role of regularization and displacement parameter in EnRDA and empirically examined their effects on the optimal joint histogram, coupling the background state and observations, and consequently on the analysis state. Nevertheless, future studies are required to characterize closed-form or heuristic expressions to expand our understating of their impacts on the forecast uncertainty. As was 410 explained earlier, unlike the Euclidean DA methodologies that assimilate available information using different relative weights across multiple dimensions through the error covariance matrices; a scalar displacement parameter is utilized in the the EnRDA that interpolates uniformly between all dimensions of the problem. Future research can be devoted to developing a framework that utilizes a vector representation of the displacement parameters to effectively tackle possible heterogeneity of uncertainty across 415 multiple dimensions.
In it's current form, the EnRDA requires the observation operator to be smooth and bijectve. This is a limitation when observations of all problem dimensions are not available and propagation of observations to non-observed dimensions is desired. Extending the EnRDA methodology to include partially observed systems seems to be an important future research area. This could include performing a rough inversion 420 for unobserved components of the system offline or extending the methodology in the direction of particle flows (Van Leeuwen et al., 2019).
Lastly, we should mention that the EnRDA is computationally expensive as it involves estimation of the coupling through the Wasserstein distance. On a desktop machine with a 3.4 GHz CPU clock rate, it took around 1600 s to complete 50 independent simulations on Lorenz-63 for the EnRDA compared to 651 425 (590) s for the particle filter (EnKF) with 100 particles (ensemble members). Since the computational cost is nonlinearly related to the problem dimension, it is expected that it grows significantly for large-scale geophysical DA and becomes a limiting factor. Although the entropic regularization works well for the presented low dimensional problems, future research is needed to test its efficiency in high-dimensional problems. Constraining the solution of the coupling on a submanifold of probability distributions with 430 a Gaussian mixture structure (Chen et al., 2019b) can be a future research direction for lowering the computational cost.
Code availability. A demo code for EnRDA in the MATLAB programming language can be downloaded at https://github.com/tamangsk/

EnRDA
Author contributions. S.K.T. and A.E. designed the study. S.K.T. implemented the formulation and analyzed the results. P.J.L. and G.L.

435
provided conceptual advice and all authors contributed to the writing.
Competing interests. The authors declare no competing interests.
Acknowledgements. The first and second author acknowledge the grant from the National Aeronautics and Space Administration (NASA) Terrestrial Hydrology Program (THP, 80NSSC18K1528) and the New (Early Career) Investigator Program (NIP, 80NSSC18K0742). The third author acknowledges support from the European Research Council for funding via the Horizon2020 CUNDA project under number 440 694509. The fifth author also acknowledges support from National Science Foundation (NSF, DMS1830418).