Ensemble Riemannian Data Assimilation: Towards High-dimensional Implementation

This paper presents the results of the Ensemble Riemannian Data Assimilation for relatively high-dimensional nonlinear dynamical systems, focusing on the chaotic Lorenz-96 model and a two-layer quasi-geostrophic (QG) model of atmospheric circulation. The analysis state in this approach is inferred from a joint distribution that optimally couples the background probability distribution and the likelihood function, enabling formal treatment of systematic biases without any Gaussian assumptions. Despite the risk of the curse of dimensionality in the computation of the coupling distribution, compar5 isons with the classic implementation of the particle filter and the stochastic ensemble Kalman filter demonstrate that with the same ensemble size, the presented methodology could improve the predictability of dynamical systems. In particular, under systematic errors, the root mean squared error of the analysis state can be reduced by 20% (30%) in Lorenz-96 (QG) model.


Introduction
The science of data assimilation (DA) aims to optimally combine the information content of observations with forecasts of 10 Earth system models (ESM) to improve the estimation of their initial conditions and thus their predictive capabilities (Kalnay, 2003;Carrassi et al., 2018). Current DA methodologies, either variational (Lorenc, 1986;Zupanski, 1993;Courtier et al., 1994;Rabier et al., 2000;Poterjoy and Zhang, 2014) or filtering (Kalman, 1960;Bishop et al., 2001;Anderson, 2001;Tippett et al., 2003;Janjić et al., 2011;Carrassi and Vannitsem, 2011;Anderson and Lei, 2013;Lei et al., 2018), largely rely on penalization of second-order statistics of the unbiased model and observation errors over the Euclidean space. For example, in the three-15 dimensional variational (3D-Var) DA (Lorenc, 1986;Courtier et al., 1998;Lorenc et al., 2000;Li et al., 2013), a least-squares cost function comprising of weighted Euclidean distances of the state from the previous model forecasts (background state) and the observations is formulated. Solution of this cost function leads to an analysis state, which is a weighted average of the forecasts and observations across multiple dimensions of the problem with the weights dictated by prescribed background and observation error covariance matrices. The variants of the Kalman filtering DA methods (Evensen, 1994a;Reichle et al., 2002; distribution and the normalized likelihood function without any parametric assumptions about their shapes and (ii) formally 55 penalize systematic translations between them arising due to potential geophysical biases.
However, the computational complexity of finding an optimal joint coupling between two m−dimensional probability distributions supported on d points in each dimension using the entropic regularization is O(d 2m ). This might impose a significant limitation on the direct use of EnRDA for high-dimensional geophysical problems, where the problem dimension easily exceed millions. As will be discussed later, the joint distribution in EnRDA is sampled at N 2 support points, with N number 60 of ensembles, reducing the computational complexity to O(N 2 ) at the expense of losing accuracy in optimal estimation of the joint distribution. Therefore, beyond implementation on a low-dimensional dynamical system, such as the 3-dimensional Lorenz-63, the key questions that we aim to answer are as follows: Does the effectiveness of the EnRDA implementation still remain valid on high-dimensional DA problems where the ensemble size is smaller than the problem dimension? How does EnRDA perform, under systematic errors, in comparison to classic ensemble DA techniques with comparable ensemble size? 65 To answer the above questions, we implement EnRDA on the relatively high-dimensional chaotic Lorenz-96 system (Lorenz, 1995) and a two-layer quasi-geostrophic (QG) model of atmospheric circulation (Pedlosky et al., 1987). The results demonstrate that EnRDA can potentially enhance predictability of high-dimensional geophysical systems, when the state variables are not necessarily Gaussian and are corrupted with systematic errors.
The outline of the paper is as follows. Section 2 provides a brief background on optimal mass transport and Wasserstein 70 distance. A brief review of the EnRDA methodology is presented in Section 3. Section 4 presents different test cases of implementation on the Lorenz-96 and the QG model and documents the performance of the presented approach in comparison with the classic implementation of the standard particle filter with resampling and the Stochastic Ensemble Kalman Filter (SEnKF). A summary and concluding remarks are presented in Section 5. The details of the entropic regularization for the EnRDA, and covariance inflation and localization procedures for the SEnKF are provided in Appendix A.

2 Background on OMT and the Wasserstein Barycenter
We provide a brief background on the theory of optimal mass transport (OMT) and Wasserstein barycenters. The OMT theory, first put forward by Monge (1781), aims to find the minimum cost of transporting distributed masses of materials from known source points to target points. The theory was later expanded as a new tool to compare probability distributions (Brenier, 1987;Villani, 2003) and since then has found its applications in the field of data assimilation (Ning et al., 2014;Feyeux et al., 2018;80 Li et al., 2018;Tamang et al., 2020), subsurface geophysical inverse problems (Chen et al., 2018a;Yong et al., 2019) and comparisons of climate model simulations (Vissio et al., 2020).
Let us consider a discrete source probability distribution p(x) = M i=1 p xi δ xi and a target distribution p(y) = N j=1 p yj δ yj with their respective probability masses {p x ∈ R M + : i p xi = 1} and {p y ∈ R N + : j p yj = 1} supported on m-and nelement column vectors x i ∈ R m and y j ∈ R n , respectively. The notation p x ∈ R M + represents probability masses p x contain-85 ing non-negative real numbers supported on M points, whereas δ x is the Dirac function at x. In the Monge formulation, the goal is to seek an optimal surjective transportation map T a # p(x) = p(y) that "pushes forward" the source distribution p(x) towards the target distribution p(y), with a minimum transportation cost as follows: where c(·, ·) ∈ R + represents the cost of transporting a unit mass from one support point in x to another one in y.

90
The problem formulation by Monge as expressed in Equation 1, however, is non-convex and the existence of an optimal transportation map is not guaranteed (Chen et al., 2019b) -especially, when the number of support points for the target distribution exceeds that of the source distribution (N > M ) (Peyré et al., 2019). This limitation was overcome by Kantorovich (1942) who introduced a probabilistic formulation of OMT -allowing splitting of probability mass from a single source point to multiple target points. The Kantorovich formalism recasts the OMT problem in a linear programming framework that finds an 95 optimal joint distribution or coupling U a ∈ R M ×N + that couples the marginal source and target distributions with the following optimality criterion: where tr(·) is the trace of a matrix, (·) T is the transposition operator and 1 M represents an M -element column vector of ones. In the above formulation, the known {C ∈ R M ×N + : c ij = x i − y j 2 2 } denotes the so-called transportation cost matrix which is 100 defined based on the 2 -norm · 2 or the Euclidean distance between the support points of the source and target distributions.
Here, the (i, j) th element u a ij of optimal joint distribution U a represents the respective amount of mass transported from support point x i to y j . Then, the 2-Wasserstein distance or metric between the marginal probability distributions is defined as the square root of the optimal transportation cost d W (p x , p y ) = tr(C T U a ) 1 2 (Dobrushin, 1970;Villani, 2008). It should be noted that due to the linear equality and non-negativity constraints in Equation 2, the family of joint distributions that satisfy 105 these constraints forms a bounded convex polytope (Cuturi and Peyré, 2018) and consequently, the optimal joint distribution U a is located on one of the extreme points of such a polytope (Peyré et al., 2019).
Recalling that over the Euclidean space, the barycenter of a group of points is equivalent to their (weighted) mean value.

Ensemble Riemannian Data Assimilation (EnRDA)
In this section, to be self content, we provide a brief summary of the EnRDA methodology while more details can be found in (Tamang et al., 2021). Let us assume that the evolution of the i th ensemble member x i ∈ R m of ESM simulations can be 120 presented as the following stochastic dynamical system: where M : R m − → R m is the deterministic nonlinear model operator, evolving the model state in time with a stochastic error term ω t i ∈ R m . This dynamical system is observed at time t through an observation equation where H : R m − → R n maps the state to the observation space and υ t ∈ R n represents an additive observation error. Note that the error 125 terms are not necessarily drawn from Gaussian distributions but need to have finite second-order moments.
Hereafter, we drop the time superscript for brevity and represent the model (or background) probability distribution as p(x) = M i=1 p xi δ xi with its probability mass vector {p x ∈ R M + : i p xi = 1}. Furthermore, the normalized likelihood function is represented as p(y|x) centered at the given observation y with its probability mass vector { p y|x ∈ R N + : j p y|x j = 1}. The probability distribution of the analysis state p(x a ), is then defined as the Wasserstein barycenter between forecast distribution 130 and the normalized likelihood function: where η ∈ [0, 1] is a displacement parameter that controls the relative weight of the background and observation. The displacement parameter η is a hyperparameter that captures the relative weights of the histogram of the background state and likelihood function in characterization of the analysis state distribution as a Wasserstein barycenter. The optimal value of η needs to be 135 determined offline, using reference data through cross-validation studies. It is important to note that the above formalism requires all dimensions to be observable and thus those dimensions with no observations cannot be updated, which is a limitation of the current formalism compared to the Euclidean DA. This limitation is further discussed later on in Section 5.
To solve the above DA problem, we need to characterize the background distribution and the normalized likelihood function.
Similar to the approach used in particle filter (Gordon et al., 1993;van Leeuwen, 2010), we suggest approximating them through where z ij = η x i + (1 − η) y j represents the support points of the analysis distribution and u a ij are the elements of the joint distribution {U a ∈ R M ×N + : i j u ij = 1}. It is important to note that the analysis state histogram, at each assimilation cycle, is supported on at most M + N − 1 points, which is the maximum number of non-zero entries in the optimal joint coupling (Peyré et al., 2019). To keep the number of ensemble members constant throughout, M ensemble members are resampled from p(x a ) using the multinomial resampling scheme (Li et al., 2015).

150
Computation of the joint distribution in Equation 2 is computationally expensive as explained previously and can be prohibitive for high-dimensional geophysical problems. As suggested by Cuturi (2013), to reduce the computational cost, we regularize the cost function in the optimal transportation plan formulation of EnRDA by a Gibbs-Boltzmann entropy function: where γ ∈ R + is a regularization parameter. The entropic regularization transforms the original OMT formulation to a strictly 155 convex problem, which can then be efficiently solved using Sinkhorn's algorithm (Sinkhorn, 1967). The details of Sinkhorn's algorithm for solving regularized optimal transportation problems are presented in Appendix A1. The regularization parameter γ balances the solution between the optimal joint distribution and the one that maximizes the relative entropy. It is evident from Equation 7 that at the limit γ − → 0, the solution of Equation 7 converges to the analysis joint distribution with a minimum morphing cost. However, as the value of γ increases, the convexity of the problem also increases, enabling the deployment 160 of more efficient optimization algorithms than classic solvers of linear programming problems (Dantzig et al., 1955;Orlin, 1993). At the same time, the number of non-zero entries of the joint coupling increases from M + N − 1 to M N points as γ − → ∞, which results in a maximum entropy solution that converges to U a − → p x p T y|x . For a more comprehensive explanation of EnRDA, one can refer to Tamang et al. (2021).
As an example, we examine here the solution of Equation 5 between a banana-shaped distribution denoted by ) and a bivariate Gaussian distribution as a function of the displacement parameter η ∈ [0, 1] -resembling the background distribution p(x) and the normalized likelihood function p(y|x), respectively with regularization parameter γ = 1000. As seen from Figure 1, for lower values of η, the analysis state distribution is closer to the observation and its shape resembles the Gaussian distribution. However, as the value of η increases, the analysis state distribution moves closer to the background distribution and starts morphing into a banana-shaped distribution.
Therefore, the analysis state distribution is defined as the one that is sufficiently close to the background distribution and the normalized likelihood function not only based on their shape but also their central location -depending on the displacement parameter. Thus, unlike the Euclidean barycenter, this approach does not guarantee that the mean or mode of the analysis state probability distribution is a minimum mean-squared error estimate of the initial condition.
It is important to note that in the original OMT formulation, the number of support points required for the optimal joint 175 coupling U scales with the problem dimension (d m ) making it potentially restrictive for the high-dimensional problems, where d represents the number of support points in each dimension and m is the number of dimensions. The presented EnRDA setting bypasses this problem by sampling the joint distribution using only N ensemble members. However, such approximation might lead to errors in optimal estimation of the joint coupling that are translated into the analysis state. In the next section, we present results from systems of well-known dynamics to investigate whether EnRDA can lead to a proper approximation of the 180 analysis state under systematic errors in relatively high-dimensional nonlinear problems, when compared to classic ensemble DA approaches with comparable ensemble size.

Lorenz-96
The Lorenz model (Lorenz-96, Lorenz, 1995), which is widely adopted as a testbed for numerous DA experiments (Trevisan 185 and Palatella, 2011; Tang et al., 2014;Shen and Tang, 2015;Lguensat et al., 2017;Tian et al., 2018), offers a simplified representation of the extra-tropical dynamics in the Earth's atmosphere. The model coordinates {x = (x 1 , . . . , x K ) T ∈ R K } at K dimensions represent the state of an arbitrary atmospheric quantity measured along the Earth's latitudes at K equally spaced longitudinal slices. The model is designed to mimic the continuous-time variation in atmospheric quantities due to interactions between three major components namely advection, internal dissipation, and external forcing. The model dynamics 190 is represented as follows: where F ∈ R + is a constant external forcing independent of the model state. The Lorenz-96 model has cyclic boundaries with It is known that for small values of F < 8/9, the system approaches a steady state condition with each coordinate value converging to the external forcing x k − → F, ∀k, whereas for F > 8/9, chaos develops 195 (Lorenz and Emanuel, 1998). For standard model setup with F = 8, the system is known to exhibit highly chaotic behavior with the largest Lyapunov exponent of 1.67 (Brajard et al., 2020).

Experimental Setup, Results and Discussion
We focus on the 40-dimensional Lorenz-96 system (i.e. K = 40) and compare EnRDA results with the classic implementation of the particle filter (PF, Gordon et al., 1993;Van Leeuwen, 2009;van Leeuwen, 2010;Poterjoy and Anderson, 2016) and 200 the Stochastic Ensemble Kalman filter (SEnKF, Evensen, 1994b;Houtekamer and Mitchell, 1998;Burgers et al., 1998;Janjić et al., 2011;Anderson, 2016;Van Leeuwen, 2020). Similar to the experimental setting suggested in (Lorenz and Emanuel, 1998;Nerger et al., 2012a), we initialize the model by choosing x 20 = 8.008 and x k = 8 for all other model coordinates. In order to avoid any initial transient effect, the model in Equation 8 is integrated for 1000 time steps using the fourth-order Runge-Kutta approximation (Runge, 1895;Kutta, 1901) with a non-dimensional time step of ∆t = 0.01 and the endpoint of 205 the run is utilized as the initial condition for DA experimentation.
Similar to the suggested experimental setting in (van Leeuwen, 2010), we obtain the ground truth by integrating Equation 8 with a time step of ∆t over a time period of T = 0-20 in the absence of any model error. The observations are assumed to be available at each assimilation time interval of 10∆t and deviated from the ground truth by a Gaussian error υ t ∼ N (0, σ 2 obs Σ ρ ), with σ 2 obs = 1 and the correlation matrix Σ ρ ∈ R 40×40 + with 1 on the diagonals, 0.5 on the first sub-and super-diagonals, and 0 210 everywhere else. The observation time step of 10∆t is equivalent to 12 hours in global ESMs (Lorenz, 1995).
To characterize the distribution of the background state for each DA methodology, 50 (5000) ensemble members (particles) for the SEnKF and EnRDA (PF) are generated using model errors ω t ∼ N (0, σ 2 t I 40 ) with σ 2 t = 0.25 for t > 0 and σ 2 0 = 4, where throughout I m represents an m × m identity matrix. To alleviate the known degeneracy problem in the PF, a higher number of particles was used. Furthermore, to introduce additional systematic background error, we utilize an erroneous 215 external forcing of F m = 6 instead of the "true" forcing value F = 8. To have a robust inference, the average values of the error metrics are reported for 50 experiments using different random realizations. As will be elaborated later on, we set the EnRDA displacement parameter η = 0.44, determined through a cross-validation study based on a minimum mean-squared error criterion. This tuning is similar to tuning inflation and localization parameters in a typical EnKF, or tuning length-scales  As previously noted, the displacement parameter η plays an important role in EnRDA as it controls the shape and position of the analysis state distribution relative to the background distribution and the normalized likelihood function. Currently, there exists no known closed-form solution for optimal approximation of this parameter. Therefore, in this paper, we focus on 235 determining its optimal value through heuristic cross-validation by an offline bias-variance trade-off analysis. Specifically, we quantify the rmse of the EnRDA analysis state for different values of η for 50 independent simulations.
The bias and rmse, together with their respective 5 th -95 th percentile bounds, as functions of the displacement parameter η are shown in Figure 4a. As explained earlier, when η increases, the analysis distribution moves towards the background distribution. Since the background state is systematically biased due to the erroneous external forcing, the analysis bias increases 240 monotonically with η; while the rmse shows a minimum point. Therefore, there exists a form of bias-variance trade-off in the analysis error, which leads to an approximation of an optimal value of η based on a minimum rmse criterion. It is important to note that the background uncertainty and thus the optimal value of η varies in response to the ensemble size as shown in Figure 4b. The reason is that a larger number of ensemble members reduces the uncertainty in the characterization of the background, but the bias is not affected. To compensate, a larger optimal value for η is needed. This optimal value approaches 245 an asymptotic value as the ensemble sample size increases and will achieve the highest value at the limit M − → ∞, when the sample moments converge to the biased forecast moments.
One may argue that such a tuning favors EnRDA since it explicitly accounts for the effects of bias, either in background or observations, while there is no bias correction mechanism in the implementation of the SEnKF and the PF. To make a fairer comparison, we investigate an alternative approach to approximate the displacement parameter solely based on the known  Wasserstein distances from ground truth to both likelihood function and forecast distribution enables to obtain an optimal value 255 for η. Even though such distances are not known in reality, the total Wasserstein distance between the normalized likelihood function and the forecast distribution is known at each assimilation cycle. Therefore, given an estimate of the distance between the ground truth and the normalized likelihood function or the forecast distribution, leads to an approximation of η.
It is known that the square of the Wasserstein distance between two equal-mean Gaussian distributions N (µ, Σ 1 ) and (Chen et al., 2019b). Therefore, under the assumption that only the back-260 ground state is biased, the square of the Wasserstein distance between the true state x tr , as a Dirac delta function, and the normalized likelihood function reduces to tr(R). At the same time, the square of the Wasserstein distance between the normalized likelihood function and forecast distribution is tr(C T U a ). Therefore, we can approximate the interpolation parameter as η a = tr(R) tr(C T U a ) + tr(R) −1 without any explicit a priori knowledge of bias.
Comparisons of the rmse values for the studied DA methodologies as a function of ensemble size are shown in Figure 5.

265
For EnRDA, the displacement parameter is obtained from the bias-aware cross-validation (η = 0.44, EnRDA-I) and from the known error covariances as explained above (EnRDA-II). The SEnKF and EnRDA result in smaller error metrics with a much smaller ensemble size than PF. As seen, EnRDA can perform well even for smaller ensemble sizes as low as 20. Its results quickly stabilize with more than 40 ensemble members and exhibit a marginal improvement over the SEnKF (12-24%) in the presence of bias. The rmse of the SEnKF also stabilizes quickly but remains above the standard deviation of the observation 270 error indicating that in the presence of bias, the lowest possible variance, known as the Cramer-Rao Lower Bound (Cramér, 1999;Rao et al., 1973) cannot be met. It is also important to note that the higher rmse of the PF compared to the SEnKF and EnRDA is due to the problem of filter degeneracy which is further exacerbated by the presence of systematic errors in model forecasts (Poterjoy and Anderson, 2016). To alleviate this problem, one may investigate the use of methodologies suggested in recent years including the auxiliary 275 particle filter where the weights of the particles at each assimilation cycle are defined based on the likelihood function from the next cycle using a pre-model run (Pitt and Shephard, 1999), the backtracking particle filter in which the analysis state is backtracked to identify the time step when the filter became degenerate (Spiller et al., 2008) as well as sampling from a transition density to pull back particles towards observations (van Leeuwen, 2010).

280
The multilayered quasi-geostrophic (QG, Pedlosky et al., 1987) model is known as one of the simplest circulation models capable of providing a reasonable representation of the mesoscale variability in geophysical flows. In its simplified form, the QG model describes the conservation of potential vorticity {ζ k } K k=1 in K vertically-mixed vertical layers: where u k = − ∂Ψ k ∂φ and v k = ∂Ψ k ∂λ represent the zonal and meridional components of the velocity field, obtained from the 285 geostrophic approximation; {Ψ k } K k=1 is the streamfunction in K layers; and λ and φ are the zonal and meridional coordinates, respectively.
For a two-layer QG model (K = 2), the potential vorticity at any time step is the sum of the relative vorticity, the planetary vorticity and the stretching term, given by: is the Coriolis parameter linearly varying with the meridional coordinate φ (β-plane approximation), f 0 is the Coriolis parameter at mid-basin where φ = φ 0 , g = g(ρ 2 − ρ 1 ) ρ 2 is the reduced value of the gravitational acceleration g, ρ k and h k are the density and thickness of the k th layer, respectively. The QG model has been the subject of numerous experiments to test the performance of DA techniques (Evensen, 1994b;Evensen and Van Leeuwen, 1996;Fisher and Gürol, 2017;Penny et al., 2019;Cotter et al., 2020).
From the initial value of the streamfunction field in each layer, potential vorticity is obtained using a nine-point second-order finite difference scheme to compute the Laplacian in Equation 10. The model in Equation 9 is then integrated with a time step of ∆t = 0.5 hr using the fourth-order Runge-Kutta approximation to advect and obtain potential vorticity at internal grid points for the next time step. The streamfunction at the next time step is then calculated from this potential vorticity by solving the  To characterize the distribution of the background state, 50 ensemble members for both SEnKF and EnRDA are generated 320 using model errors ω t ∼ N (0, α σ 2 t I m λ ×m φ ) for each layer with σ 2 0 = 10 8 m 4 s −2 and σ 2 t = 5×10 6 m 4 s −2 for t > 0, where the factor α ∈ [0, 1] grows linearly from 0 at the northern and southern boundaries to 1 at mid-basin. To introduce systematic errors in the forecast, we utilize a multiplicative error of 0.015% in the QG model by multiplying the potential vorticity obtained from Equation 10 at every ∆t with a factor of 1.00015. At each assimilation cycle, N = 500 samples of the observations are obtained by perturbing the observations with the heteroscedastic Gaussian noise with standard deviation 10% of the mean magnitude of 325 the ground truth.
In the SEnKF, to alleviate the well-known problem of undersampling (Anderson, 2012) and improve its performance, we utilize covariance inflation (Anderson and Anderson, 1999) and localization (Houtekamer and Mitchell, 2001;Hamill, 2001) as discussed in Appendix A2. For EnRDA, similar to the Lorenz-96 setup (Section 4.1.1), the displacement parameter is set to η = 0.4 through a cross-validation study based on a minimum rmse criterion as shown in Table 1 The results of the DA experiments using the SEnKF and EnRDA at the first assimilation cycle for the bottom layer are 335 also shown in Figure 7. It can be seen that, in the SEnKF, the streamfunction values are slightly overestimated, signaling the persistence of bias in the analysis state (Figure 7a). This is further evident as the analysis error field is coherent and structured ( Figure 7b). On the other hand, it appears that EnRDA (Figure 7d) results in a more incoherent error field with a reduced bias ( Figure 7e). The rmse for the EnRDA (0.28 × 10 6 m 2 s −1 ) is lower than the one by the SEnKF (0.46 × 10 6 m 2 s −1 ). However,  We further examined the performance of the EnRDA and the SEnKF on the QG model with a ± 50% change in the assimilation interval of 12 hr as shown in Figure. 8. To make the comparison fair between different assimilation intervals which have a different number of assimilation cycles and to eliminate the impact of transient behavior, we only report the statistics 345 for the last 15 assimilation steps. With the increase in assimilation interval, the systematic error grows in the forecast largely due to the multiplicative error being added to the forecast at every time step. Therefore, as is expected, with the increase in assimilation interval, the rmse grows monotonically and the performance of the DA methodologies degrades. However, the EnRDA demonstrates consistent improvement over a bias-blind implementation of the SEnKF (20-33%) across the range of assimilation intervals. On average, using cluster with 24 cores and a clock rate of 2.5 GHz, it took around 3.5 hr to complete 350 one independent simulation on Q-G model for EnRDA compared to 2.5 hr for EnKF each with 50 ensemble members.

Summary and Concluding Remarks
In this study, we demonstrated that data assimilation ( One of the major weaknesses of the presented methodology in its current form is that all dimensions of the problem are assumed to be observable. This is an important issue when it comes to the assimilation of sparse data. Future research is needed to address partial observability in DA over the Wasserstein space. A possible direction is through multi-marginal 365 optimal mass transport (Pass, 2015), which could enable to couple different dimensions of the problem and propagate the information content of sparse observations to unobserved dimensions. Moreover, currently, the displacement parameter is constant across multiple dimensions of the problem. Future research is needed to understand how the displacement parameter can be estimated differently depending on the error structure across different dimensions of the state space. Another option is to perform the EnRDA only in that part of the state space that is directly observed and use the ensemble covariance to update the 370 unobserved part of state space, similar to a SEnKF. We anticipate that expanding the application of the presented methodology for assimilating satellite data into land-atmosphere models could be another promising future direction of research given the fact that these models are often markedly biased (Dee and Da Silva, 1998;Chepurin et al., 2005;De Lannoy et al., 2007;Lin et al., 2017).

A1 Sinkhorn's Algorithm for Optimal Mass Transport
To solve the regularized optimal mass transport problem in Equation 7, we utilize Sinkhorn's algorithm (Sinkhorn, 1967). To that end, first, the Lagrangian form of the Equation 7 using two Lagrange multipliers a ∈ R M and b ∈ R N is obtained as follows: Now, we set the first-order derivative of the Lagrangian form in Equation A1, with respect to (i, j) th element of the joint distribution (u ij ) to zero: which ultimately leads to u ij = exp . This can be rewritten in a matrix form as is the Gibb's kernel of the cost matrix C, and s ∈ R M , t ∈ R N are the unknown scaling vectors. The notation diag(x) ∈ R M ×M represents a diagonal matrix with its diagonal entries provided by x ∈ R M .
By setting the derivatives of the Lagrangian with respect to the Lagrange multipliers as zero we recover the two conditions, which we can write as p x = diag(s)V diag(t)1 N and p y|x = diag(t)V T diag(s)1 M leading to:

390
where the notation x y represents a Hadamard element-wise division of equal length vectors. The form presented in Equation A3 is known as the matrix scaling problem (Borobia and Cantó, 1998) and can be efficiently solved iteratively: where i is the iteration count and the algorithm is initialized with a positive vector t (0) = 1 N . In our implementation, we set the iteration termination criterion as s (i) − s (i−1) 2 s (i−1) 2 ≤ 10 −4 or i > 300. After the convergence of the solution for s and t, the 395 optimal joint distribution can be obtained as U a = diag(s)V diag(t).

A2 Covariance Inflation and Localization in Ensemble Kalman Filter
The ensemble size in the Stochastic Ensemble Kalman filter (SEnKF), if much smaller than the state dimension, such as in the presented case of the quasi-geostrophic model, leads to underestimation of the forecast error covariance matrix and subsequently filter divergence problems. To alleviate this problem, a covariance inflation procedure can be implemented by 400 multiplying the forecast error covariance matrix by an inflation factor τ > 1 ( Anderson and Anderson, 1999) where its optimal value depend on the ensemble size  and other characteristics of the problem at hand.
The covariance localization procedure in the SEnKF further attempts to improve its performance by ignoring the spurious long-range dependence in the ensemble background covariance by applying a prespecified cutoff threshold on the correlation structure of the field. An SEnKF equipped with a tuned localization procedure can be efficiently used in high-dimensional 405 atmospheric and ocean models even with less than 100 ensemble members (Anderson, 2012). The covariance localization in an SEnKF is accomplished by modifying the Kalman gain matrix K ∈ R m×m through implementation of a Hadamard element-wise product of the forecast error covariance matrix B ∈ R m×m with a distance-based correlation matrix ρ ∈ R m×m : where X Y represent the Hadamard element-wise product between equal size matrices X and Y.

415
In our implementation of the SEnKF in the QG model, the inflation factor and length scale were chosen between τ = 1.01 − 1.08 and d = 400 − 1800 [km] respectively depending on the experimental setup through trial and error analysis to minimize the root mean squared error.