Introduction

NPG

Nonlinear Processes in Geophysics

NPG

Nonlin. Processes Geophys.

1607-7946

Copernicus Publications

Göttingen, Germany

10.5194/npg-24-701-2017

The Onsager–Machlup functional for data assimilation

Sugiura

Nozomi

nsugiura@jamstec.go.jp Research and Development Center for Global Change, JAMSTEC, Yokosuka, Japan

Nozomi Sugiura (nsugiura@jamstec.go.jp)

1December2017

24 4 701712 26July2017 27July2017 26October2017 26October2017

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://npg.copernicus.org/articles/24/701/2017/npg-24-701-2017.html

The full text article is available as a PDF file from https://npg.copernicus.org/articles/24/701/2017/npg-24-701-2017.pdf

When taking the model error into account in data assimilation, one needs to evaluate the prior distribution represented by the Onsager–Machlup functional. Through numerical experiments, this study clarifies how the prior distribution should be incorporated into cost functions for discrete-time estimation problems. Consistent with previous theoretical studies, the divergence of the drift term is essential in weak-constraint 4D-Var (w4D-Var), but it is not necessary in Markov chain Monte Carlo with the Euler scheme. Although the former property may cause difficulties when implementing w4D-Var in large systems, this paper proposes a new technique for estimating the divergence term and its derivative.

Introduction

In traditional weak-constraint 4D-Var settings e.g., a quadratic cost function is defined as the negative logarithm of the probability for each sample path, which is suitable for path sampling e.g.. The optimisation problem is naively described as finding the most probable path by minimising the quadratic cost function. However, the term “the most probable path” does not make sense in this context, because the paths are not countable. One should note that the concern is not about ranking the individual path probabilities, but about seeking the route with the densest path population. To define the optimisation problem properly, one should introduce a macroscopic variable ϕ=ϕ(t) that represents a smooth curve, and introduce a measure that accounts for how densely the paths are populated in the ϵ-neighbourhood centred at ϕ, which can be termed “the tube”. Then the problem is defined as finding the most probable tube ϕ, which represents the maximum a posteriori (MAP) estimate of the path distribution. Mathematicians pioneering the theory of stochastic differential equations (SDEs) e.g. have been aware of this subtle point since the 1980s, and established the proper form of the cost function as the Onsager–Machlup (OM) functional for the tube.

The aim of this work is to organise existing knowledge about the OM functional into a form that can be used to represent model errors in data assimilation, i.e. numerical evaluation of non-linear smoothing problems.

Throughout this article, we consider non-linear smoothing problems of the form dxt=f(xt)dt+σdwt,x0∼N(xb,σb2I),(∀m∈M)ym|xm∼N(xm,σo2I), where t is time, x is a D-dimensional stochastic process, w is a D-dimensional Wiener process, xb∈RD is the background value of the initial condition, σb>0 is the standard deviation of the background value, ym∈RD is observational data at time tm, xm=xtm,tm=mδt, M is the set of observation times, σo>0 is the standard deviation of the observational data, and σ>0 is the noise intensity. Note that there is no need to distinguish the Ito integral from the Stratonovich integral with regard to the discretisation of the SDE, because the noise intensity is a constant.

Before moving on to its applications, here we review the concept of the OM functional. To make presentation simple, we assume that D=1 and σ=1, and concentrate on the formulation of the prior distribution in the subsequent two Sects. and .

OM functional for path sampling

The model Eq. () is discretised with the Euler scheme (with the drift term at the previous time) as xn=xn-1+f(xn-1)δt+ξn-1,n=1,2,⋯,N, where δt is the time increment, and each ξn-1 obeys N(0,δt). Equation () can be considered a non-linear mapping F1:ξ↦x from the noise vector ξ=(ξ0,ξ1,⋯,ξN-1)T to the state vector x=(x1,x2,⋯,xN)T. The inverse of the mapping is linearised as δξ0δξ1⋮δξN-1=10⋯00-1-δtf′(x1)100⋮⋮00⋯-1-δtf′(xN-1)1δx1δx2⋮δxN, where f′ is the derivative of f, and the Jacobian is DF1-1=dξ/dx=1.

It is also discretised with the trapezoidal scheme (with the drift term at the midpoint) as xn=xn-1+f(xn)+f(xn-1)2δt+ξn-1,n=1,2,⋯,N, which defines a mapping F2:ξ↦x. The inverse of the mapping is linearised as δξ0δξ1⋮δξN-1=1-δt2f′(x1)0⋯00-1-δt2f′(x1)1-δt2f′(x2)00⋮⋮00⋯-1-δt2f′(xN-1)1-δt2f′(xN)δx1δx2⋮δxN, whose Jacobian is DF2-1=dξ/dx=∏n=1N1-(δt/2)f′(xn)exp⁡-(δt/2)∑n=1Nf′(xn).

Generally, we can assign a measure μ0 to a cylinder set Ω^≡Ω^0×Ω^1×⋯×Ω^N-1 in the noise space using a density g as follows. μ0(Ω^)=∫Ω^0dξ0∫Ω^1dξ1⋯∫Ω^N-1dξN-1g(ξ0,ξ1,⋯,ξN-1)=∫Ω^g(ξ)λ(dξ)=∫Ω^μ0(dξ), where λ is the Lebesgue measure on RN. In our case, we can see that a small area dξ in the noise space is equipped with a measure: μ0(dξ)=g(ξ)λ(dξ),g(ξ)≡1(2πδt)N/2e-12δt∑n=1Nξn-12.

Suppose we have a cylinder set Ω≡Ω1×Ω2×⋯×ΩN in the state space, where each Ωn⊂R1 is on time slice t=nδt. Now, the mapping F1 (or F2) induces a measure through the change of variables from ξ to x with respect to the measure μ0 as μi(Ω)=∫Ω1dx1∫Ω2dx2⋯∫ΩNdxN(g∘Fi-1)(x1,x2,⋯,xN)DFi-1=∫Ωμi(dx),i=1,2. In our case, each mapping assigns the following measure to a small area dx in the corresponding state space: μ1(dx)≡g(F1-1(x))DF1-1λ(dx)=1(2πδt)N/2e-δt2∑n=1Nxn-xn-1δt-f(xn-1)2λ(dx),μ2(dx)≡g(F2-1(x))DF2-1λ(dx)=1(2πδt)N/2e-δt2∑n=1Nxn-xn-1δt-f(xn-12)2+f′(xn)λ(dx), where f(xn-12)=f(xn)+f(xn-1)2.

Measures μ1 and μ2 represent the occurrence probability of the noise seen from the state space, and thus can be used for path sampling.

The change-of-measure argument (Appendix ) or the path integral argument e.g. shows that similar forms are available for time-continuous and multi-dimensional processes, except that the term f′(xt) is promoted to divf(xt).

OM functional for mode estimate

If we perform path sampling with a sufficient number of paths, in theory we can find the mean of distribution by averaging the samples, or the mode of distribution by organising them into a histogram. Still, in some practical applications, we must efficiently find the mode of distribution by variational methods; computationally, this approach is much cheaper than path sampling. For that purpose, we are tempted to use a quadratic cost function for the minimisation. However, we can illustrate a simple example against maximising the path probability () to obtain the mode of distribution. Suppose we have a discrete-time stochastic system in R1, starting from x0=0, and we move forward two time steps, x1=x0+x02δt+ξ0=ξ0,x2=x1+x12δt+ξ1=ξ0+ξ02δt+ξ1, where ξ0 and ξ1 obey independent normal distributions N(0,δt). It may be seen as a discrete version of dxt=xt2dt+dwt. It is easy to notice that the mode of distribution (x1,x2) is not (0,0) owing to the non-linear term ξ02δt. On the other hand, according to the path probability (), μ1(dx1dx2)∝exp⁡-δt2x1-x0δt-x022+x2-x1δt-x122λ(dx1dx2), the best trajectory is (x1,x2)=(0,0), which has no noise: (ξ0,ξ1)=(0,0). We expect a path with the highest probability at (x1,x2)=(0,0), but it is not the route where the paths are most concentrated.

Motivated by this example, we shall investigate a proper strategy to find the route that maximises the density of paths. In this regard, we ask how densely the paths populate in the small neighbourhood of a curve ϕ=ϕ(t) in the state space.

Assuming that f and ϕ are twice continuously differentiable, we evaluate the density of paths in the ϵ-neighbourhoods around a curve ϕ connecting points {ϕn,n=1,2,⋯,N} with the following integral: Iϵ,δt(ϕ)=∫ϕ1-ϵϕ1+ϵdx1∫ϕ2-ϵϕ2+ϵdx2⋯∫ϕN-ϵϕN+ϵdxN1(2πδt)N/2exp⁡-δt2∑n=1Nxn-xn-1δt-f(xn-1)2=∫-ϵϵdv1∫-ϵϵdv2⋯∫-ϵϵdvN1(2πδt)N/2exp⁡-δt2∑n=1Nvn-vn-1δt+ϕn-ϕn-1δt-f(vn-1+ϕn-1)2=∫-ϵϵdv1∫-ϵϵdv2⋯∫-ϵϵdvN1(2πδt)N/2exp⁡-δt2∑n=1Nvn-vn-1δt2×exp⁡-δt2∑n=1Nϕn-ϕn-1δt-f(vn-1+ϕn-1)2+2ϕn-ϕn-1δt-f(vn-1+ϕn-1)vn-vn-1δt. By regarding vn in Eq. () as being generated according to the probability 1(2πδt)N/2e-δt2∑n=1Nvn-vn-1δt2, we can interpret the integration as a weighted ensemble averaging of a random function up to a numerical constant. The sequence vn can be set as a random walk v0=0,vn=∑k=1nξk, where ξk are independent normal random variables obeying N(0,δt). For simplicity, we rather assume that ξk takes values ±δt with 0.5 probability for either one, because Donsker's theorem ensures it has the same probability law as the former when δt is sufficiently small. We suppose δt<ϵ so that no step of the random walk escapes from the ϵ-neighbourhood. Accordingly, the integral is expressed as the ensemble average with respect to random walks confined in the tube [0,Nδt]×[-ϵ,ϵ]: Iϵ,δt(ϕ)∝Eve-J(ϕ,v)|(∀n)vn<ϵ,J(ϕ,v)≡δt2∑n=1Nϕn-ϕn-1δt-f(vn-1+ϕn-1)2+2ϕn-ϕn-1δt-f(vn-1+ϕn-1)vn-vn-1δt, where Ev denotes the ensemble averaging of the random walks denoted by v, each of which follows the route (v0,v1,⋯,vN) and satisfies |vn|<ϵ for all n.

Because vn-1 is small, we can apply the expansion f(vn-1+ϕn-1)=f(ϕn-1)+f′(ϕn-1)vn-1+O(v2), where f′ is the derivative of f. Let us accept that the following average containing the higher-order terms O(v2) converges (see Eq. ). Eve∑n=1NO(v2)(vn-vn-1)|(∀n)vn<ϵ⟶ϵ→01. As shown in Appendix , the remaining terms in the exponent -J(ϕ,v) are less than O(ϵ), except for the following one. ∑n=1Nf′(ϕn-1)vn-1vn-vn-1=∑n=1Nf′(ϕn-1)12(vn-1-vn)+12(vn-1+vn)vn-vn-1=∑n=1Nf′(ϕn-1)12(vn-1-vn)vn-vn-1+∑n=1Nf′(ϕn-1)12(vn2-vn-12)=-12∑n=1Nf′(ϕn-1)ξn2+12∑n=1N-1f′(ϕ(tn-1))-f′(ϕ(tn-1+δt))vn2+12f′(ϕN-1)vN2=-δt2∑n=1Nf′(ϕn-1)+O(ϵ2).ξn=±δt,f′(ϕ(tn-1))-f′(ϕ(tn-1+δt))=O(δt),vn2<ϵ2.

Consequently, we obtain the asymptotic expression for the ensemble average when ϵ is small and δt<ϵ2: Iϵ,δt(ϕ)∝Eve-δt2∑n=1Nϕn-ϕn-1δt-f(ϕn-1)2+f′(ϕn-1)+O(ϵ)+∑n=1NO(v2)(vn-vn-1)|(∀n)vn<ϵ→e-12∫0Tϕ˙(t)-f(ϕ(t))2+f′(ϕ(t))dt. Appendix shows that a similar form is available for time-continuous and multi-dimensional processes, except that the term f′(ϕ(t)) is promoted to divf(ϕ(t)).

Importantly, the control variable for the optimisation has changed from x to ϕ.

Probabilistic description of data assimilation

Using the OM functional derived in Sect. and as a model error term, we shall develop a probabilistic description of data assimilation.

Based on the argument in Sect. , Eq. () has the transition probability at discrete time steps P(xn|xn-1)∝exp⁡-δt2σ2xn-xn-1δt-f(xn-1)2, called the Euler scheme, which uses the drift f(xn-1) at the previous time step. Section also shows that this transition probability has another expression: P(xn|xn-1)∝exp⁡-δt2σ2xn-xn-1δt-f(xn-12)2-δt2divf(xn),f(xn-12)≡f(xn)+f(xn-1)2,divf(x)≡∑i=1D∂fi∂xi(x), which can be called the trapezoidal scheme because the integral is evaluated with the drift terms at both ends of each interval. The transition probability leads to the prior probability P(x|x0) of a path x={xn}0≤n≤N as follows: P(x|x0)∝exp⁡-δt∑n=1N12σ2xn-xn-1δt-f(xn-1)2⇋exp⁡-δt∑n=1N12σ2xn-xn-1δt-f(xn-12)2+12divf(xn), where the “⇋” sign indicates that, if δt is sufficiently small, the equations on both sides are compatible.

On the other hand, based on the argument in Sect. , we can also define the probability P(Uϕ|ϕ0) for a smooth tube that represents its neighbouring paths Uϕ=ω|(∀n)|ϕn-xn(ω)|<ϵ: P(Uϕ|ϕ0)∝exp⁡-δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-1)2+12divf(ϕn-1). The scaling argument for a smooth curve in Appendix allows us to use the drift term f(ϕn-12) instead in Eq. (): P(Uϕ|ϕ0)∝exp⁡-δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-12)2+12divf(ϕn-12).

The corresponding posterior probabilities are thus given as follows: Ppath(x|y)∝exp⁡(-Jpath(x|y)),Jpath(x|y)≡12σb2x0-xb2+∑m∈M12σo2xm-ym2+δt∑n=1N12σ2xn-xn-1δt-f(xn-1)2⇋12σb2x0-xb2+∑m∈M12σo2xm-ym2+δt∑n=1N12σ2xn-xn-1δt-f(xn-12)2+12divf(xn) for a sample path, and Ptube(Uϕ|y)∝P(Uϕ|ϕ0)P(ϕ0)P(y|Uϕ)∝exp⁡(-Jtube(ϕ|y)),Jtube(ϕ|y)≡12σb2ϕ0-xb2+∑m∈M12σo2ϕm-ym2+δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-12)2+12divf(ϕn-12)⇋12σb2ϕ0-xb2+∑m∈M12σo2ϕm-ym2+δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-1)2+12divf(ϕn-1), for a smooth tube. Note that different pairs of time-discretisation schemes of the OM functional, 12σ2dxdt-f(x)2+12div(f), are nominated for paths and for tubes in Eqs. (), (), (), and ().

Method Four schemes for OM

In the argument in Sects. and , the prior probability has a form P(x|x0)∝exp⁡-δt∑n=1NOM̃, where OM̃ is the OM functional. As a proof-of-concept described in these sections, we will test all the cases with conceivable combinations of the timing of the drift term f(xt) and the presence or absence of the divergence term. Including those shown in Eqs. (), (), (), and (), as well as those that are potentially incorrect, the possible candidates for the discretisation schemes of the OM functional are as follows, where the symbol ψ represents either ϕ for a smooth curve or x for a sample path.

Euler scheme (E) e.g.:OM̃E≡12σ2ψn-ψn-1δt-f(ψn-1)2;

Euler scheme with divergence term (ED):OM̃ED≡12σ2ψn-ψn-1δt-f(ψn-1)2+12divf(ψn-1);

trapezoidal scheme (T):OM̃T≡12σ2ψn-ψn-1δt-f(ψn-12)2;

trapezoidal scheme with divergence term (TD) e.g.:OM̃TD≡12σ2ψn-ψn-1δt-f(ψn-12)2+12divf(ψn-12),

where f(ψn-12)=(f(ψn)+f(ψn-1))/2.

Data assimilation algorithms

By using one of the above schemes adopted for the model error term in the cost function, we can apply a data assimilation algorithm – either Markov chain Monte Carlo (MCMC) e.g. or four-dimensional variational data assimilation (4D-Var) e.g.. Among versions of MCMC, we focus on the Metropolis-adjusted Langevin algorithm (MALA) e.g.. MALA samples the paths x(k)={xn(ωk)}0≤n≤N according to the distribution Ppath by iterating x(k+1)=x(k)-α∇Jpath+2αξ,α>0,ξ∼N(0,1)D(N+1),∇J=∂J∂xT, with the Metropolis rejection step for adjustment, to obtain an ensemble of sample paths according to the posterior probability, while 4D-Var seeks the centre of the most probable tube ϕ={ϕn}0≤n≤N by iterating: ϕ(k+1)=ϕ(k)-α∇Jtube,α>0. Note that if the OM functional of type OM̃ED is used, the gradient is of the form ∇ϕnJtube=1σb2(ϕ0-xb)δ0,n+∑m∈M1σo2(ϕm-ym)δm,n+1σ2ϕn-ϕn-1δt-f(ϕn-1)(n>0)+δtσ2-1δt-∂f∂ϕn(ϕn)Tϕn+1-ϕnδt-f(ϕn)+δt2∂∂ϕndivf(ϕn)(n<N), where ∂f∂ϕn(ϕn)T is an adjoint integration starting from the subsequent term, which is typical in gradient calculations in 4D-Var. In comparison, the term ∂∂ϕndivf(ϕn) requires the second derivative of f, which is not typical in 4D-Var, and could be difficult to implement in large dimensional systems.

To investigate the applicability of the four candidate schemes in Sect. , we use them in these algorithms.

The results should be checked with “the correct answer”. The reference solution that approximates the correct answer is provided by a particle smoother (PS) e.g., which does not involve the explicit computation of prior probability. When we have observations only at the end of the assimilation window, the PS algorithm is as follows.

Generate samples of initial and model errors, integrate M copies of the model, and use them to obtain a Monte Carlo approximation of the prior distribution:P(x)≃1M∑m=1M∏n=0Nδ(xn-χn(m)),

where χn(m) is the state of member m at time n.

Results Example A (hyperbolic model)

In our first example, we solve the non-linear smoothing problem for the hyperbolic model , which is a simple problem with one-dimensional state space, but which has a non-linear drift term. We want to find the probability distribution of the paths described by dxt=tanh⁡(xt)dt+dwt,xt=0∼N(0,0.16), subject to an observation y: y|xt=5∼N(xt=5,0.16),y=1.5. The setting follows . In this case, divf(x)=1/cosh⁡2(x) imposes a penalty for small x. The total time duration T=5 is divided into N=100 segments with δt=5×10-2.

Figure shows the probability densities of paths normalised on each time slice, Pt=n(ϕ)=∫P(Uϕ|y)dϕt≠n, derived by MCMC and PS. PS is performed with 5.1×1010 particles. It is clear that MCMC with E or TD provides the proper distribution matched with that of PS; this is also clear from the expected paths yielded by these experiments, as shown in Fig. . These schemes correspond to candidates in Eqs. () and (). The expected path by ED bends towards a larger x, which should be caused by an extra penalty for a larger x. The expected path by T bends towards a smaller x, which should be caused by the lack of a penalty for a larger x.

The results of 4D-Var, which represents the MAP estimates, are shown in Fig. . ED and TD provide the proper MAP estimate. These schemes correspond to candidates in Eqs. () and (). The expected paths by E and T bend towards a smaller ϕ, which should be caused by the lack of a penalty for a larger ϕ.

Probability density of paths derived by MCMC and PS for the hyperbolic model. (a) Reference solution by PS, (b) solution by MCMC with scheme E or TD, (c) solution by MCMC with scheme ED, and (d) solution by MCMC with scheme T.

Expected path derived by MCMC (hyperbolic model).

Most probable tube derived by 4D-Var (hyperbolic model).

Example B (Rössler model)

In our second example, we solve the non-linear smoothing problem for the stochastic Rössler model . We want to find the probability distribution of the paths described by dx1=(-x2-x3)dt+σdw1,dx2=(x1+ax2)dt+σdw2,dx3=(b+x1x3-cx3)dt+σdw3, xt=0∼N(xb,0.04I), subject to an observation y: y|xt=0.4∼N(xt=0.4,0.04I), where (a,b,c)=(0.2,0.2,6),σ=2, xb=(2.0659834,-0.2977757,2.0526298)T, and y=(2.5597086,0.5412736,0.6110939)T. In this case, divf(x)=x1+a-c imposes a penalty for large x1. The total time duration T=0.4 is divided into N=800 segments with δt=5×10-4.

The results by MCMC and 4D-Var for the Rössler model are shown in Figs. and , respectively. The state variable x1 is chosen for the vertical axes. PS is performed with 3×1012 particles. The curve for PS in Fig. indicates ϕ^=argmaxϕP(ϕ|y), where U represents the tube centred at ϕ with radius 0.03.

Figure shows that, just as for the hyperbolic model, E and TD provide the proper expected path. Figure shows that ED and TD provide the proper MAP estimate.

Expected path derived by MCMC (Rössler model).

Most probable tube derived by 4D-Var (Rössler model).

Applicable OM schemes.

with div(f) without div(f) Sampling by MCMC Euler scheme

✓

trapezoidal scheme

✓

MAP estimate by 4D-Var Euler scheme

✓

trapezoidal scheme

✓

Towards application to large systems

When one computes the cost value J(x), the negative logarithm of the posterior probability, in data assimilation, the value f(x) is explicitly computed by the numerical model while divf(x) is not. If the dimension D of the state space is large, and f is complicated, the algebraic expression of divf(x) can be difficult to obtain. The gradient of the cost function ∇J(x) contains the derivative of f(x), which can be implemented as the adjoint model by symbolic differentiation e.g.. However, schemes with the divergence term require the calculation of the second derivative of f(x), for which the algebraic expression can be even more difficult to obtain. Still, there may be a way to circumvent this difficulty by utilising Hutchinson's trace estimator (see Appendix ). It is also clear that the Euler scheme without the divergence term is more convenient for implementing path sampling, because it does not require cumbersome calculation of the divergence term.

Conclusions

We examined several discretisation schemes of the OM functional, 12σ2dxdt-f(x)2+12div(f), for the non-linear smoothing problem dxt=f(xt)dt+σdwt,x0∼N(xb,σb2I),(∀m∈M)ym|xm∼N(xm,σo2I), by matching the answers given by MCMC and 4D-Var with that given by PS, taking the hyperbolic model and the Rössler model as examples. Table lists the discretisation schemes which were found to be applicable, i.e. those expected to converge to the same result as the reference solution. These results are consistent with the literature e.g..

This justifies, for instance, the use of the following cost function for the MAP estimate given by 4D-Var: J=ϕ0-xb22σb2+∑m∈Mϕm-ym22σo2+δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-1)2+12divf(ϕn-1), where n is the time index, δt is the time increment, xb is the background value, σb is the standard deviation of the background value, y is the observational data, σo is the standard deviation of the observational data, and σ is the noise intensity. However, the divergence term above should be excluded for the assignment of path probability in MCMC.

For application in large systems, the Euler scheme without the divergence term is preferred for path sampling because it does not require cumbersome calculation of the divergence term. In 4D-Var, the divergence term can be incorporated into the cost function by utilising Hutchinson's trace estimator.

The code for data assimilation is available at https://github.com/nozomi-sugiura/OnsagerMachlup/.

Scaling of the terms

Taylor expansion of the f(ψn-1) term around ψn-12 in scheme E gives OM̃≃∑n=1Nδtσ-2ψn-ψn-1δt-f(ψn-12)-(ψn-ψn-1)∂f∂x(ψn-12)2+div(f)=δtσ-2(noise+shift)2+divergence.noise≡ψn-ψn-1δt-f(ψn-12),shift≡(ψn-ψn-1)∂f∂x(ψn-12),divergence≡div(f), where we assume order-one fluctuations, σ=O(1), and the symbol ψ represents either ϕ for a smooth curve or x for a sample path.

For a sample path of the stochastic process, the scaling is ψn-ψn-1=O(δt12), which leads to OM̃=∑δtσ-2noise2︸δt-1+noise×shift︸1+shift2︸δt+divergence︸1. The shift term induces a Jacobian that coincides with the divergence term in TD .

In the case of a smooth curve, there is no stochastic term, and thus ψn-ψn-1 is the product of a bounded function f(ψn-1) and δt, which results in a value with O(δt). This leads to OM̃=∑δtσ-2noise2︸1+noise×shift︸δt+shift2︸δt2+divergence︸1. The shift term is negligible, but the divergence term is not.

Divergence term Divergence term in a trapezoidal scheme

Consider two stochastic processes (cf. Sect. 6.3.2 of ): dxt=f(xt)dt+dwt,x(0)=x0,dxt=dwt,x(0)=x0, where Eq. () has measure μ and Eq. () has measure μ0 (Wiener measure). By the Girsanov theorem, the Radon–Nikodym derivative of μ with respect to μ0 is dμdμ0=exp⁡-∫0T12|f(x)|2dt-f(x)⋅dx. If we define F(xT)-F(x0)=∫x0xTf(x)∘dx with the Stratonovich integral, then by Ito's formula, dF=f⋅dx+12div(f)dt. Eliminating f⋅dx in Eq. () using Eq. (), we obtain dμdμ0=exp⁡-∫0T12|f(x)|2dt+F(xT)-F(x0)-12∫0Tdiv(f)dt. Substituting F(xT)-F(x0)=∫0Tf∘dxdtdt, dμdμ0=exp⁡-∫0T12|f(x)|2dt+∫0Tf∘dxdtdt-12∫0Tdiv(f)dt. If we write the Wiener measure formally as μ0(dx)=exp⁡-12∫0Tdxdt2dtdx, we get the following from Eq. (), μ(dx)=exp⁡-∫0T12dxdt-f(x)2dtdx, and the following from Eq. (), μ(dx)=exp⁡-∫0T12dxdt-f(x)2+div(f)dtdx, where the integrals should be interpreted in the Ito sense and in the Stratonovich sense, respectively.

Divergence term for smooth tube

When weight is assigned to smooth tubes, there should always be a divergence term, for the following reason.

Let x be a diffusion process that follows the stochastic differential equation dxt=f(xt)dt+dwt, where w is a Wiener process. To investigate paths near a smooth curve ϕ, let us consider the following stochastic process xt-ϕ(t) : d(xt-ϕ(t))=(f(xt-ϕ(t)+ϕ(t))-ϕ˙(t))dt+dwt. This means that if a drift f is applied to the Wiener process, and the reference frame is shifted by ϕ, the process xt-ϕ(t) which has the drift f(⋅+ϕ)-ϕ˙ is obtained. The weight relative to the Wiener measure can be calculated by Girsanov's formula as follows. Iϵ(ϕ)≡P(‖x-ϕ‖T<ϵ)P(‖w‖T<ϵ)=Eexp⁡∫0Tf(wt+ϕ(t))-ϕ˙(t)⋅dwt-12∫0Tf(wt+ϕ(t))-ϕ˙(t)2dt|‖w‖T<ϵ, where the expectation is taken with respect to the Wiener process w conditioned to ‖w‖T≡sup⁡0<t<T|wt|<ϵ. We are going to evaluate the terms containing wt in the exponent on the RHS of Eq. ().

If we assume ϕ is a twice continuously differentiable function, then by applying Ito's product rule to ϕ˙(t)⋅wt, and using (∀t)|wt|<ϵ,∫0Tϕ˙(t)⋅dwt=ϕ˙(T)⋅wT-∫0Twt⋅ϕ¨(t)dt≤A1ϵ,

where A1 is a positive constant independent of ϵ.

If we assume f is a twice continuously differentiable function, then by using (∀t)|wt|<ϵ,∫0Tf(wt+ϕ(t))⋅ϕ˙(t)dt-∫0Tf(ϕ(t))⋅ϕ˙(t)dt≤A2ϵ,

where A2 is a positive constant independent of ϵ.

In the similar manner as in 1,∫0Tf(wt+ϕ(t))2dt-∫0Tf(ϕ(t))2dt≤A3ϵ,

where A3 is a positive constant independent of ϵ.

The evaluation of ∫0Tf(wt+ϕ(t))⋅dwt is as follows. a.

By applying Taylor's expansion to f(wt+ϕ(t)),∫0Tf(wt+ϕ(t))⋅dwt=∫0Tf(ϕ(t))⋅dwt+∫0T(wt⋅∇)f(ϕ(t))⋅dwt+∫0TO(w2)⋅dwt.

By applying Ito's product rule to wt⋅f(ϕ(t)), and using (∀t)|wt|<ϵ,∫0Tf(ϕ(t))⋅dwt=wT⋅f(ϕ(T))-∫0T∑i,jwti∂fi∂xj(ϕ(t))ϕ˙j(t)dt=O(ϵ).

Regarding the second term on the RHS of Eq. (), we see that∫0T(wt⋅∇)f(ϕ(t))⋅dwt+12∫0T∇⋅f(ϕ(t))dt=∫0T∑i,j∂fi∂xj(ϕ(t))wtjdwti+12∫0T∑i,jδij∂fi∂xj(ϕ(t))dt=∫0T∑i,j∂fi∂xj(ϕ(t))wtjdwti+12δijdt=∫0T∑i,j∂fi∂xj(ϕ(t))dζtji,

where ζtji=∫0twsj∘dwsi (Stratonovich integral).

By applying evaluations (1)–(4) to Eq. (), we obtain Iϵ(ϕ)=exp⁡-12∫0Tf(ϕ(t))-ϕ˙(t)2dt-12∫0T∇⋅f(ϕ(t))dt×Eexp⁡O(ϵ)+O(ϵ2)+∫0T∑i,j∂fj∂xi(ϕ(t))dζtji+∫0TO(|w|2)⋅dwt|‖w‖T<ϵ. On pages 450–451 in , it is shown that Eexp⁡c∫0T∑i,j∂fj∂xi(ϕ(t))dζtji|‖w‖T<ϵ⟶ϵ→01(∀c),Eexp⁡c∫0TO(|w|2)⋅dwt|‖w‖T<ϵ⟶ϵ→01(∀c), and it is obvious that Eexp⁡cO(ϵ)+cO(ϵ2)|‖w‖T<ϵ⟶ϵ→01(∀c). They also showed that if Eexp⁡caj|‖w‖T<ϵ⟶ϵ→01(∀c) for j=1,2,⋯,J, then Eexp⁡∑j=1Jaj|‖w‖T<ϵ⟶ϵ→01. By applying this to Eqs. (), (), and (), we deduce from Eq. () that Iϵ(ϕ)⟶ϵ→0exp⁡-12∫0Tf(ϕ(t))-ϕ˙(t)2dt-12∫0T∇⋅f(ϕ(t))dt. From evaluation (4), we also have that Eexp⁡∫0Tf(wt+ϕ(t))⋅dwt|‖w‖T<ϵ⟶ϵ→0exp⁡-12∫0Tdivf(ϕ(t))dt.

Equation () serves as an evaluation formula for the divergence term along ϕ by ensemble calculation if we interpret the expectation as an ensemble average: ln⁡Eexp⁡∫0Tf(wt+ϕ(t))⋅dwt|‖w‖T<ϵ⟶ϵ→0-12∫0Tdivf(ϕ(t))dt. The ensemble can be generated by using a Wiener process limited to the small area ‖w‖T<ϵ. Taking the derivative of Eq. () with respect to ϕi(t), we also obtain the formula for evaluating the derivative of the divergence term along ϕ, as follows. E∇f(ϕ+w)⋅dwexp⁡∫0Tf(ϕ+w)⋅dw|‖w‖T<ϵEexp⁡∫0Tf(ϕ+w)⋅dw|‖w‖T<ϵ⟶ϵ→0-12∇(divf)dt, where (∇f(ϕ+w),dw)=∑j∂fj(ϕ+w)∂ϕidwj can be calculated using the adjoint model ∇f(ϕ+w). Although these evaluation formulas () and () illustrate the meaning of the divergence term, they seem too expensive to be used in the 4D-Var iterations.

Estimator for the divergence term

Cost functions in Eqs. () and () utilise the derivative of the drift term f(x), and thus the gradient of the term contains the second derivative of f(x), whose algebraic form is difficult to obtain in high-dimensional systems. Here, we propose an alternative form using Hutchinson's trace estimator , which approximates the trace of matrix E[ξTAξ]=tr(A) using a stochastic vector whose components are independent, identically distributed stochastic variables that take value ±1 with probability 0.5.

A realisation of the cost function is given as J^tube(ϕ|y)=12σb2ϕ0-xb2+∑m∈M12σo2ϕm-ym2+δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-1)2+12ξn-1Tb-1f(ϕn-1+bξn-1)-f(ϕn-1), where b is a small number. Note that J^tube(ϕ|y) is a stochastic variable that satisfies EJ^tube(ϕ|y)=Jtube(ϕ|y). If the adjoint of f is at hand, the gradient of the stochastic cost function is given as ∇ϕnJ^tube(ϕ|y)=1σb2(ϕ0-xb)δ0,n+∑m∈M1σo2(ϕm-ym)δm,n+1σ2ϕn-ϕn-1δt-f(ϕn-1)(n>0)+δtσ2-1δt-∂f∂ϕn(ϕn)Tϕn+1-ϕnδt-f(ϕn)(n<N)+δt2∂f∂ϕn(ϕn+bξn)Tb-1ξn-∂f∂ϕn(ϕn)Tb-1ξn.(n<N) The iterations similar to Eq. (), ϕ(k+1)=ϕ(k)-α∇J^tube, will work.

The author declares that she has no conflict of interest.

Acknowledgements

The author is grateful to the referees for their comments which helped improve the readability of the paper. This work was partly supported by MEXT KAKENHI Grant-in-Aid for Scientific Research on Innovative Areas JP15H05819. All the numerical simulations were performed on the JAMSTEC SC supercomputer system.Edited by: Zoltan Toth Reviewed by: two anonymous referees

References Apte et al.(2007)Apte, Hairer, Stuart, and Voss

Apte, A., Hairer, M., Stuart, A. M., and Voss, J.: Sampling the posterior: An approach to non-Gaussian data assimilation, Phys. D, 230, 50–64, 10.1016/j.physd.2006.06.009, 2007.

Cotter et al.(2013)Cotter, Roberts, Stuart, White et al.

Cotter, S. L., Roberts, G. O., Stuart, A., and White, D.: MCMC methods for functions: modifying old algorithms to make them faster, Stat. Sci., 28, 424–446, 2013.

Daum(1986)

Daum, F.: Exact finite-dimensional nonlinear filters, IEEE T. Automat. Contr., 31, 616–622, 1986.

Doucet et al.(2000)Doucet, Godsill, and Andrieu

Doucet, A., Godsill, S., and Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian filtering, Stat. Comput., 10, 197–208, 2000.

Dutra et al.(2014)Dutra, Teixeira, and Aguirre

Dutra, D. A., Teixeira, B. O. S., and Aguirre, L. A.: Maximum a posteriori state path estimation: Discretization limits and their interpretation, Automatica, 50, 1360–1368, 2014.

Giering and Kaminski(1998)

Giering, R. and Kaminski, T.: Recipes for adjoint code construction, ACM Transactions on Mathematical Software (TOMS), 24, 437–474, 1998.

Hutchinson(1990)

Hutchinson, M. F.: A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines, Commun. Stat. Simulat., 19, 433–450, 1990.

Ikeda and Watanabe(1981)

Ikeda, N. and Watanabe, S.: Stochastic Differential Equations and Diffusion Processes, vol. 24 of North-Holland Mathematical Library, chap. VI.9, North-Holland, 1981.

Law et al.(2015)Law, Stuart, and Zygalakis

Law, K., Stuart, A., and Zygalakis, K.: Data Assimilation, Springer, 2015.

Malsom and Pinski(2016)

Malsom, P. J. and Pinski, F. J.: Role of Ito's lemma in sampling pinned diffusion paths in the continuous-time limit, Phys. Rev. E, 94, 042131, 10.1103/PhysRevE.94.042131, 2016.

Metropolis et al.(1953)Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.: Equation of state calculations by fast computing machines, J. Chem. Phys., 21, 1087–1092, 1953.

Onsager and Machlup(1953)

Onsager, L. and Machlup, S.: Fluctuations and irreversible processes, Phys. Rev., 91, 1505, 10.1103/PhysRev.91.1505, 1953.

Roberts and Rosenthal(1998)

Roberts, G. O. and Rosenthal, J. S.: Optimal scaling of discrete approximations to Langevin diffusions, J. R. Stat. Soc. B, 60, 255–268, 1998.

Rössler(1976)

Rössler, O.: An equation for continuous chaos, Phys. Lett. A, 57, 397–398, 10.1016/0375-9601(76)90101-8, 1976.

Stuart et al.(2004)Stuart, Voss, and Wilberg

Stuart, A. M., Voss, J., and Wilberg, P.: Conditional Path Sampling of SDEs and the Langevin MCMC Method, Commun. Math. Sci., 2, 685–697, 2004.

Trémolet(2006)

Trémolet, Y.: Accounting for an imperfect model in 4D-Var, Q. J. Roy. Meteor. Soc., 132, 2483–2504, 10.1256/qj.05.224, 2006.

Zeitouni(1989)

Zeitouni, O.: On the Onsager–Machlup functional of diffusion processes around non C2 curves, Ann. Probab., 17, 1037–1054, 1989.

Zinn-Justin(2002)

Zinn-Justin, J.: Quantum Field Theory and Critical Phenomena, chap. 4.6, Oxford University Press, 4th Edn., 2002.

Zupanski(1997)

Zupanski, D.: A general weak constraint applicable to operational 4DVAR data assimilation systems, Mon. Weather Rev., 125, 2274–2292, 1997.

</app></app-group></back> </article>