Application of ensemble transform data assimilation methods for parameter estimation in nonlinear problems

Abstract. Over the years data assimilation methods have been developed to obtain estimations of uncertain model parameters by taking into account a few observations of a model state. However, most of these computationally affordable methods have assumptions of Gaussianity, e.g. an Ensemble Kalman Filter. Ensemble Transform Particle Filter does not have the assumption of Gaussianity and has proven to be highly beneficial for an initial condition estimation and a small number of parameter estimation in chaotic dynamical systems with non-Gaussian distributions. In this paper we employ Ensemble Transform Particle 5 Smoother (ETPS) and Ensemble Transform Kalman Smoother (ETKS) for parameter estimation in nonlinear problems with 1, 5, and 2500 uncertain parameters and compare them to importance sampling (IS). We prove that the updated parameters obtained by ETPS lie within the range of an initial ensemble, which is not the case for ETKS. We examine the performance of ETPS and ETKS in a twin experiment setup, where observations of pressure are synthetically created based on the know values of parameters. The numerical experiments demonstrate that the ETKS provides good estimations of the mean parameters 10 but not of the posterior distributions and as the ensemble size increases the posterior does not improve. ETPS provides good approximations of the posterior and as the ensemble size increases the posterior converges to the posterior obtained by IS with a large ensemble. ETKS is very robust while ETPS is very sensitive with respect to the initial ensemble. An issue of an increase in the root mean square error after data assimilation is performed in ETPS for a high-dimensional test problem is resolved by applying distance-based localization, which however deteriorated the posterior estimation. 15


Introduction
If a solution of a considered partial differential equations (PDE) is highly sensitive to its parameters, accurate estimation of the parameters and their uncertainties is essential to obtain a just approximation of the solution. Partial observations of the solution are then used to infer uncertain parameters by solving a PDE-constrained inverse problem. For instance one can approach such problems via methods induced by Bayes's formula (Stuart, 2010). More specifically the posterior probability density of the 20 parameters given the data, is then computed on the basis of a prior probability density and a likelihood which is the conditional probability density associated with the given noisy observations. Well-posedness of an inverse problem and convergence to the true posterior in the limit of observational noise going to zero was proven for different priors and under assumptions on the parameter-to-observation map by Dashti and Stuart (2017), for example. the Bayesian approximation.
In order to sample highly correlated samples, one can employ optimal transport resampling that lies at the heart of the ensemble transform particle filter (ETPF) proposed by Reich (2013). An optimal transport map between two consecutive probability measures provides a direct sample-to-sample map with maximized sample correlation. Along the lines of an adaptive 45 SMC approach a probability measure is described via the importance weights and the deterministic mapping replaces the traditional resampling step. A so-called tempered ensemble transform particle filter (TETPF) was proposed by Ruchi et al. (2019).
Note that this ansatz does not require any distributional assumption for the posterior and it was shown by Ruchi et al. (2019) that TETPF provides encouraging results for nonlinear high dimensional PDE-constrained inverse problems. However, the computational cost of solving an optimal transport problem in each iteration is considerably high.

50
In this work we address two issues arisen in the context of TETPF: (i) the immense computational costs of solving the associated optimal transport problem and (ii) the lack of robustness of the TETPF with respect to high dimensional problems.
More specifically, the performance of ETPF has been found to be highly dependent on the initial guess. Although tempering restrains any sharp fail in the importance sampling step due to a poor initial ensemble selection, the number of required intermediate steps and the efficiency of ETPF still depends on it. Chustagulprom et al. (2016) suggested that the lack of 55 robustness in high dimensions can be addressed via a hybrid approach that combines a Gaussian approximation with the ETPF.
Furthermore, Acevedo et al. (2017) suggested that the computational complexity of the ETPF can be significantly reduced via a Sinkhorn approximation to the underlying transport problem. Along the lines of Chustagulprom et al. (2016);de Wiljes et al. (2020), we propose a tempered ensemble transform particle filter with Sinkhorn approximation (TESPF) and a tempered hybrid approach. 60 The remainder of the manuscript is organised as follows: in Sect. 2, the inverse problem setting is presented. There we describe the tempered ensemble transform particle filter (TETPF) proposed by Ruchi et al. (2019). Furthermore, we introduce the tempered ensemble transform particle filter with Sinkhorn approximation (TESPF), a tempered hybrid approach that combines EKI and TETPF (hybrid EKI-TETPF), and a tempered hybrid approach that combines EKI and TESPF (hybrid EKI-TESPF).
We discuss computational complexities of all the presented techniques and provide corresponding pseudocodes in Appendix A.
In Sect. 3, we apply the adaptive SMC methods to an inverse problem of inferring high dimensional permeability parameters for a steady-state single-phase Darcy flow model. Permeability is parameterized following Ruchi et al. (2019), where one configuration of parametrization leads to Gaussian posteriors, while another one to non-Gaussian posteriors. Finally, we draw conclusions in Sect. 4.
2 Bayesian inverse problem 70 We assume u ∈Ũ ⊂ R n is a random variable that is related to partially observable quantities y ∈ Y ⊂ R κ by a nonlinear forward operator F :Ũ → Y, namely y = F (u).
Further y obs ∈ Y denotes a noisy observation of y, i.e., y obs = y + η 75 where η ∼ N (0, R). The aim is to determine or approximate the posterior measure µ(u) conditioned on observations y obs and given a prior measure µ 0 (u), which is referred to as Bayesian inverse problem. The posterior measure is absolutely continuous with respect to the prior, i.e., where ∝ is up to a constant of normalisation and g is referred to as the likelihood and depends on the forward operator F . The

80
Gaussian observation noise of the observation y obs implies where denotes the transpose.

Tempered Sequential Monte Carlo
We consider sequential Monte Carlo (SMC) methods that approximate the posterior measure µ(u) via an empirical measure Here δ is the Dirac function, and the importance weights for the approximation of µ are .
An ensemble U = {u 1 , . . . , u M } ⊂Ũ consists of M realizations u i ∈ R n of a random variable u that are independent and identically distributed according to u i ∼ µ 0 .

90
When an easy to sample from prior µ 0 does not approximate the complex posterior µ well, only a few weights w i have significant value resulting in a degenerative approximation of the posterior measure. Potential reasons for this effect are high dimensionality of the uncertain parameter, large number of observations, or accuracy of the observations. An existing solution to a degenerative approximation is an iterative approach based on tempering by Del Moral et al. (2006) or annealing by Neal (2001). The underlying idea is to introduce T intermediate artificial measures {µ t } T t=0 between µ 0 and µ T = µ. These measures 95 are bridged by introducing T tempering parameters {φ t } T t=1 that satisfy 0 = φ 0 < φ 1 < . . . < φ T = 1. An intermediate measure µ t is defined as a probability measure that has density proportional to g(u) with respect to the previous measure µ t−1 Along the lines of Iglesias (2016) the tempering parameter φ t is chosen such that effective ensemble size (ESS) does not drop below a certain threshold 1 < M thresh < M . Then an approximation of the posterior measure µ t is A bisection algorithm on the interval (φ t−1 , 1] is employed to find φ. If ESS t > M thresh we set φ t = 1 which implies that no 105 further tempering is required.

Mutation
Due to stationarity of the parameters an SMC method requires ensemble perturbation. We use ensemble mutation of Cotter 115 et al. (2013) with the target measure µ t and the reference measure µ 0 . The mutation phase is initialized at v 0,i =ũ t,i , and at the final inner iteration τ max we assign u t,i = v τmax,i for i = 1, . . . , M .
For a Gaussian prior we use the preconditioned Crank-Nicolson pcn-MCMC method Here m is the mean of the Gaussian prior measure µ 0 and {ξ τ,i } M i=1 are from a Gaussian distribution with zero mean and a 120 covariance matrix of the Gaussian prior measure µ 0 .
For a uniform prior U [a, b] we use random walk Here Then the ensemble at the inner iteration τ + 1 is Here v prop i is from Eq. (5) for the Gaussian measure and from Eq. (6) for the uniform measure, and The scalar θ ∈ (0, 1] controls the performance of the Markov chain. Small values of θ lead to high acceptance rates but poor 130 mixing. Roberts and Rosenthal (2001) showed that for high-dimensional problems it is optimal to choose θ such that the acceptance rate is between 20 % and 30 % by the last tempering iteration T . Cotter et al. (2013) proved that under some assumptions this mutation produces a Markov kernel with an invariant measure µ t .
Computational complexity. In each tempering iteration t the computational complexity of the pcn-MCMC mutation is where C is computational cost of a forward model F . For the pseudocode of the pcn-MCMC mutation please 135 refer to the Algorithm 1 in Appendix A. Note that the computational complexity is not effected by the length of u which is a very desirable property in high dimensions as shown by Cotter et al. (2013) and Hairer et al. (2014).

Resampling phase
As we have already mentioned in Sect. 2.1, an adaptive SMC method with importance sampling needs to be supplied with resampling at each tempering iteration t. We consider a resampling method based on optimal transport mapping proposed 140 by Reich (2013).

Optimal transformation
The origin of the optimal transport theory lies in finding an optimal way of redistributing mass which was first formulated by Monge (1781). Given a distribution of matter, e.g., a pile of sand, the underlying question is how to reshape the matter into another form such that the work done is minimal. A century later the original problem was rewritten by Kantorovich (1942) 145 in a statistical framework that allowed to tackle it. Due to these contributions it was later named the Monge-Kantorovich minimization problem.
Let us consider a scenario where the initial distribution of matter is represented by a probability measure µ on the measurable spaceŨ, that has to be moved and rearranged according to a given new distribution ν, defined on the measurable spaceṼ. Then we seak a probability measure that is a solution to where the minimum is compute over all joint probability measures ω onŨ ×Ṽ with marginals µ and ν, and c(u,ũ) is a transport cost function onŨ ×Ṽ. The joint measures achieving the infinum are called optimal transport plans.
Let µ and ν be two measures on a measurable space (Ω, F) such that µ is the law of random variable U : Ω →Ũ and ν is the law of random variable V : Ω →Ṽ. Then a coupling of (µ, ν) consists of a pair (U, V ). Note that couplings always exist, 155 an example is the trivial coupling in which the random variables U and V are independent. A coupling is called deterministic if there exists a measurable function Ψ M :Ũ →Ṽ such that V = Ψ M (U ) and Ψ M is called transport map. Unlike general couplings, deterministic couplings do not always exist. On the other hand there may be infinitely many deterministic couplings.
One famous variant of Eq. (9), where the optimal coupling is known to be a deterministic coupling, is given by where Π(µ, ν) denotes the set of measures joining µ and ν (see Villani (2008) for details). The aim of the resampling step is to obtain equally probable samples. Therefore, in resampling based on optimal transport of Reich (2013), the Monge-Kantorovich minimization problem Eq. (10) is considered for the current posterior measure µ M t (u) given by its samples approximation Eq. (4) and a uniform measure (here the weights in the sample approximation are set to 1/M ). The discretized objective functional of the associate optimal transport problem is given by where matrix S describes a joint probability measure under the assumption that the state space is finite.
are obtained by a deterministic linear transform, i.e., Reich (2013) where T is the number of tempering iterations, τ max is the number of pcn-MCMC inner iterations, and C is computational cost of a forward model F . For the pseudocode of the TETPF please refer to the Algorithm 4 in Appendix A.

Sinkhorn approximation
As discussed above solving the optimal transport problem has a computational complexity of O = M 3 log(M ) in every iter-180 ation of the tempering procedure. Thus the TETPF becomes very expensive for large M . One the other hand an increase in the number of samples directly correlates with an improved accuracy of the estimation. In order to allow for as many samples as possible one needs to reduce the associate computational cost of the optimal transport problem. This can be achieved by replacing the optimal transport distance with a Sinkhorn distance and subsequently exploiting the new structure to elude the immense computational time of the EMD solver as shown by Cuturi (2013). More precisely the ansatz is built on the fact that 185 the original transport problem has a natural entropic bound that is obtained for S = [ 1 M I M w ] where w = [w 1 , . . . , w M ] and I M = [1, . . . , 1] ∈ R M which constitutes an independent joint probability. Therefore one can consider the problem of finding a matrix S ∈ R M ×M , that is constraint by an additional lower entropic bound (Sinkhorn distance). This additional constraint can be incorporated via a Lagrange multiplier, which leads to the above regularised form, i.e., where α > 0. Due to additional smoothness the minimum of Eq. (12) can be unique and has the form where Z is matrix with entires z ij = u t−1,i −u t−1,j 2 and b and a non-negative vectors determined by employing Sinkhorn's fixpoint iteration described by Sinkhorn (1967). We will refer to this approach as tempered ensemble Sinkhorn particle filter (TESPF).

195
Computational complexity. Solving this regularise optimal transport problems rather than original transport problem given in Eq. (9) reduces the complexity to O(M 2 C(α)). For the pseudocode of the Sinkhorn adaptation of solving the optimal transport problem please refer to the Algorithm 3 in Appendix A. For the pseudocode of the TESPF please refer to the Algorithm 4 in Appendix A.
For Bayesian inverse problems with Gaussian measures, ensemble Kalman inversion (EKI) is one of the widely used algorithm.
EKI is an adaptive SMC method that approximates the first two statistical moments of a posterior distribution. For a linear forward model, EKI is optimal in a sense it minimizes the error in the mean (Blömker et al., 2019). For a nonlinear forward model, EKI still provides a good estimation of posterior (e.g., Iglesias et al., 2018). Here we consider EKI method of Iglesias et al. (2018), since it is based on the tempering approach.

205
The intermediate measures {µ t } T t=0 are approximated by Gaussian distributed variables with empirical mean m t and empirical variance C t . Given empirical mean m t−1 and empirical covariance C t−1 defined in terms of the mean and covariance are updated by Since we are interested in an ensemble approximation of the posterior distribution, we update the ensemble members by Here y t,i = y obs + η t,i and η t,i ∼ N (0, ∆ t R) for i = 1, . . . , M .
Computational complexity. By implementing a sequential observation update of Whitaker et al. (2008), the computational complexity of solving Eq. (13) can be reduced to O(2M κn), where n is the parameter space dimension, κ is the observation where T is the number of tempering iterations, τ max is the number of pcn-MCMC inner iterations, and C is computational cost of a forward model F . For the pseudocode of the EKI method please refer to the Algorithm 5 in Appendix A.

Hybrid
Despite the underlying Gaussian assumption the EKI is remarkable robust in non-linear high-dimensional settings opposed to consistent SMC methods such as the TET(S)PF. For many non-linear problems it is desirable to have better uncertainty 225 estimates while maintaining a level of robustness. This can be achieved by factorising the likelihood given by Eq.
(2), e.g, g(u; y obs ) = g 1 (u; y obs ) · g 2 (u; y obs ), where g 1 (u; y obs ) = βg(u; and 230 g 2 (u; Then it is possible to alternate between methods with complementing properties such as the EKI and the TET(S)PF updates e.g., likelihood is used for an EKI update followed by an update with a TET(S)PF on the basis of .
This combination of an approximative Gaussian method and a consistent SMC method has been referred to as hybrid filters in the data assimilation literature 1 (Chustagulprom et al., 2016;Frei and Künsch, 2013). This ansatz can also be understood as using the EKI as an more elaborate proposal density for the importance sampling step within SMC.
1 Note that the terminology is also used in the context of data assimilation filters combining variational and sequential approaches.

Parameterisation of permeability
We consider the following two parameterisations of the permeability function k(x, y) P1: log permeability over the entire domain D, u(x, y) = log k(x, y); 255 P2: permeability over domain D that has a channel, k(x, y) = k 1 (x, y)δ Dc (x, y) + k 2 (x, y)δ D\Dc (x, y) as by Iglesias et al. (2014).
Here D c denotes a channel, δ Dirac function, k 1 = exp(u 1 (x, y)) and k 2 = exp(u 2 (x, y)) denote permeabilities inside and outside the channel. The geometry of the channel is parameterized by five parameters {d i } 5 i=1 : amplitude, frequency, angle, initial point, and width, correspondingly. The lower boundary of the channel is given by y = d 1 sin(d 2 x/6) + tan(d 3 )x + d 4 .

260
The upper boundary of the channel is given by y + d 5 . These parameters are depicted in Fig. 1.
We assume log permeability for both P1 and P2 is drawn from a Gaussian distribution µ 0 = N (m, C) with mean m and covariance C. We define C via a correlation function given by the Wittle-Matern correlation function defined by Matérn (1986) c(x, y) = 1 where γ is the gamma function, υ = 0.5 is the characteristic length scale, and Υ 1 is the modified Bessel function of the second 265 kind of order 1.
We implement a cell-centered finite difference method to solve the forward model Eqs. For P1, the prior for log permeability is a Gaussian distribution with mean 5. The grid dimension is 70, and thus the uncertain parameters and log permeability. The prior for log permeability is a Gaussian distribution with mean 15 outside the channel and with mean 100 inside the channel. The grid dimension is 50. Log permeability inside channel u 1 = {u 1, } N 2 =1 and log permeability outside channel u 2 = {u 2, } N 2 =1 are defined over the entire domain 50 × 50. Therefore, for P2 inference the uncertain parameter u = {d, u 1 , u 2 } has dimension 5005. Moreover, for P2 we use the Metropolis-within-Gibbs methodology following Iglesias et al. (2014) to separate geometrical parameters and log permeability parameters within the mutation step, 280 since it allows to better exploit the structure of the prior.

Observations
Both the true permeability and an initial ensemble are drawn from the same prior distribution as the prior includes knowledge about geological properties. However, an initial guess is computed on a coarse grid and the true solution is computed on a fine grid that has twice the resolution of the coarse grid. The synthetic observations are obtained by 285 y obs = L(P true ) + η.
An element of L(P true ) is a linear functional of pressure, namely Here σ = 0.01, ∆x 2 is the size of a grid cell X i = (X i , Y i ), N f is resolution of a fine grid, h j is the location of the observation and κ is the number of observations. This form of the observation functional and the parameterization P1 and P2 guaranty 290 the continuity of the forward map from the uncertain parameters to the observations and thus the existence of the posterior distribution as shown by Iglesias et al. (2014). The observation noise η is drawn from a normal distribution with zero mean and known covariance matrix R. We choose the observation noise to be 2 % of L2-norm of the true pressure. Such a small noise makes the data assimilation problem hard to solve, since the likelihood is very peaked and a non-iterative data assimilation approach fails.

295
To save computational costs, we choose ESS threshold M thresh = M/3 for tempering, and the length of Markov chain τ max = 20 for mutation.

Metrics
We conduct numerical experiments with ensemble sizes 100 and 500, and 20 simulations with different initial ensemble realizations to check the robustness of results. We analyze the method's performance with respect to a pcn-MCMC solution from 300 here on referred to as reference. An MCMC solution was obtained by combining 50 independent chains each of length 10 6 , 10 5 burn-in period and 10 3 thinning. For log permeability, we compute RMSE of the mean and u ref is the reference solution.
For geometrical parameters d, we compute the Kullback-Leibler divergence where p ref (d i ) is the reference posterior, p(d i ) is approximated by the weights, and M b = M/10 is a chosen number of bins.

Application to P1 inference
For P1, we perform numerical experiments using 36 uniformly distributed observations, which are displayed in circles in Fig. 3(a). We plot box plot of RMSE given by Eq. (19) over 20 independent simulations in Fig. 2(a) using Sinkhorn approxi-310 mation and in Fig. 2(b) using optimal transport. The x-axis is for the hybrid parameter β, whose value 0 corresponds to EKI and 1 to an adaptive SMC method with either Sinkhorn approximation (TESPF) or optimal transport (TETPF). Ensemble size M = 100 is shown in red and M = 500 in green. First, we observe that at a small ensemble size 100 TESPF outperforms TETPF. Since Sinkhorn approximation is a regularization of an optimal transport solution, TESPF provides a smoother solution than TETPF that can be seen in Fig. 3(c) and Fig. 3(f), respectively, where we plot mean log permeability. Next, we see 315 in Fig. 2 that the hybrid approach decreases RMSE compared to TET(S)PF: the smaller β the smaller median of RMSE. EKI gives the smallest error due to the Gaussian parametrization of permeability. The advantage of the hybrid approach is most pronounced at a large ensemble size 500 and optimal transport resampling.
We plot mean log permeability at ensemble size 100 and a smallest RMSE over 20 simulations in Fig. 3(b)-(f) and of reference in Fig. 3(a). We see that EKI and TETPF(0.2) estimate well mot only large-scale feature but also small-scale feature 320 (e.g., negative mean at the top right corner) unlike TET(S)PF and TESPF(0.2).

Application to P2 inference
For P2, we perform numerical experiments using 9 uniformly distributed observations. which are displayed in circles in Fig. 9(a). First, we display results obtained by Sinkhorn approximation. In Fig. 4 channel. We see that EKI outperforms any TESPF(·) including TESPF for amplitude (a) and width (e). This is due to Gaussianlike posteriors of these two geometrical parameters displayed in Fig. 6(c) and Fig. 6(o), respectively. Due to Gaussian-like posteriors the hybrid approach decreases RMSE compared to TESPF: the smaller β the smaller median of RMSE.
For frequency, angle, and initial point, whose KL divergence is displayed in Fig. 4(b), (c), and (d), respectively, the behaviour of adaptive SMC is nonlinear in terms of β. This is due to non Gaussian-like posteriors of these three geometrical parameters  both TESPF and EKI-there exists a β = 0 for which the KL divergence is lowest though it is inconsistent between geometrical parameters.
When comparing TESPF(·) to TETPF(·), we observe the same type of behaviour in terms of β: linear for amplitude and width, whose KL divergence is displayed in Fig. 5(a) and (e), respectively, and nonlinear for frequency, angle, and initial point, 335 whose KL divergence is displayed in Fig. 5(b), (c), and (d), respectively. However, the KL divergence is smaller when optimal transport resampling is used instead of Sinkhorn approximation.
In Fig. 6 Now we investigate adaptive SMC performance for permeability estimation. First, we display results obtained by Sinkhorn approximation. We plot box plot over 20 independent simulations of RMSE given by Eq. (19) for log permeability outside channel in Fig. 7(a) and inside channel in Fig. 7(b). Even though log permeability is Gaussian distributed, for a small ensemble 345 size 100 there exists a β = 0 that gives lowest RMSE both outside and inside channel. As ensemble size increases, methods performance becomes equivalent. Next, we compare TESPF(·) to TETPF(·) for log permeability estimation outside and inside channel whose RMSE is displayed in Fig. 8(a) and (b), respectively. We observe the same type of behaviour in terms of β: nonlinear for a small ensemble size 100, and equivalent for a larger ensemble size 500. Furthermore, at a small ensemble size 100 TESPF outperforms TETPF, 350 which was also the case for P1 parameterization Sec. 3.4.
In Fig. 9, we show mean field of permeability over the channelized domain for reference for the lowest error at ensemble size 100 for TESPF(0.2) (b), TESPF (c), EKI (d), TETPF(0.2) (e), and TETPF (f). We plot mean log permeability over the channelized domain at ensemble size 100 and a smallest RMSE over 20 simulations in Fig. 9(b)-(f) and of reference in Fig. 9(a).
We see that TESPF(0.2) does an excellent job at such a small ensemble size by estimating well log permeability outside and 355 inside channel, and parameters of the channel itself.

Conclusions
A Sinkhorn adaptation, namely the TESPF, of the previously proposed TETPF has been introduced and numerically investigated on a parameter estimation problem. The TESPF has considerable smaller computational complexity than the TETPF, ], yet has similar accuracy results (see Fig. 7, 360 8 and 6). In particular, the TESPF outperforms the EKI for non-Gaussian distributed parameters (e.g., initial point and angle in P2). This makes the proposed method a promising option for the high dimensional nonlinear problems one is typically faced with in geophysical applications. Further, to counter balance potential robustness problems of the TETPF and its Sinkhorn adaptation a hybrid between EKI and TET(S)PF is proposed and studied by means of the two configurations of the steadystate single-phase Darcy flow model. The combination of the two adaptive SMC methods with complementing properties, i.e., β ∈ (0, 1), is superior to the individual adaptive SMC method, i.e., β = 0 or 1, for all non-Gaussian distributed parameters and performs better than the pure TETPF and the TETSPF for Gaussian distributed parameters in P1. This suggests a hybrid approach provides all the desirable properties required to obtain robust and highly accurate approximate solutions of nonlinear high dimensional Bayesian inference problems.