Introduction
In traditional weak-constraint 4D-Var settings
e.g., a quadratic cost
function is defined as the negative logarithm of the probability for each
sample path, which is suitable for path sampling
e.g.. The optimisation problem is naively
described as finding the most probable path by minimising the quadratic cost
function. However, the term “the most probable path” does not make sense in
this context, because the paths are not countable. One should note that the
concern is not about ranking the individual path probabilities, but about
seeking the route with the densest path population. To define the
optimisation problem properly, one should introduce a macroscopic variable
ϕ=ϕ(t) that represents a smooth curve, and introduce a measure that
accounts for how densely the paths are populated in the
ϵ-neighbourhood centred at ϕ, which can be termed “the tube”.
Then the problem is defined as finding the most probable tube ϕ, which
represents the maximum a posteriori (MAP) estimate of the path distribution.
Mathematicians pioneering the theory of stochastic differential equations
(SDEs) e.g. have been
aware of this subtle point since the 1980s, and established the proper form
of the cost function as the Onsager–Machlup (OM) functional
for the tube.
The aim of this work is to organise existing knowledge about the OM
functional into a form that can be used to represent model errors in data
assimilation, i.e. numerical evaluation of non-linear smoothing problems.
Throughout this article, we consider non-linear smoothing problems of the
form
dxt=f(xt)dt+σdwt,x0∼N(xb,σb2I),(∀m∈M)ym|xm∼N(xm,σo2I),
where t is time, x is a D-dimensional stochastic process, w
is a D-dimensional Wiener process, xb∈RD is the
background value of the initial condition, σb>0 is the
standard deviation of the background value, ym∈RD is
observational data at time tm, xm=xtm,tm=mδt, M is the
set of observation times, σo>0 is the standard deviation of
the observational data, and σ>0 is the noise intensity. Note that
there is no need to distinguish the Ito integral from the Stratonovich
integral with regard to the discretisation of the SDE, because the noise
intensity is a constant.
Before moving on to its applications, here we review the concept of the OM
functional. To make presentation simple, we assume that D=1 and σ=1, and concentrate on the formulation of the prior distribution in the
subsequent two Sects. and .
OM functional for path sampling
The model Eq. () is discretised with the Euler scheme (with the
drift term at the previous time) as
xn=xn-1+f(xn-1)δt+ξn-1,n=1,2,⋯,N,
where δt is the time increment, and each ξn-1 obeys
N(0,δt). Equation () can be considered a
non-linear mapping F1:ξ↦x from the noise vector
ξ=(ξ0,ξ1,⋯,ξN-1)T to the state vector
x=(x1,x2,⋯,xN)T. The inverse of the mapping is linearised as
δξ0δξ1⋮δξN-1=10⋯00-1-δtf′(x1)100⋮⋮00⋯-1-δtf′(xN-1)1δx1δx2⋮δxN,
where f′ is the derivative of f,
and the Jacobian is DF1-1=dξ/dx=1.
It is also discretised with the trapezoidal scheme (with the drift term at
the midpoint) as
xn=xn-1+f(xn)+f(xn-1)2δt+ξn-1,n=1,2,⋯,N,
which defines a mapping F2:ξ↦x. The inverse of the mapping is
linearised as
δξ0δξ1⋮δξN-1=1-δt2f′(x1)0⋯00-1-δt2f′(x1)1-δt2f′(x2)00⋮⋮00⋯-1-δt2f′(xN-1)1-δt2f′(xN)δx1δx2⋮δxN,
whose Jacobian is DF2-1=dξ/dx=∏n=1N1-(δt/2)f′(xn)exp-(δt/2)∑n=1Nf′(xn).
Generally, we can assign a measure μ0 to a cylinder set
Ω^≡Ω^0×Ω^1×⋯×Ω^N-1 in the noise space using a density g as follows.
μ0(Ω^)=∫Ω^0dξ0∫Ω^1dξ1⋯∫Ω^N-1dξN-1g(ξ0,ξ1,⋯,ξN-1)=∫Ω^g(ξ)λ(dξ)=∫Ω^μ0(dξ),
where λ is the Lebesgue measure on RN. In our case, we
can see that a small area dξ in the noise space is equipped with a
measure:
μ0(dξ)=g(ξ)λ(dξ),g(ξ)≡1(2πδt)N/2e-12δt∑n=1Nξn-12.
Suppose we have a cylinder set Ω≡Ω1×Ω2×⋯×ΩN in the state space, where each Ωn⊂R1 is on time slice t=nδt. Now, the mapping F1 (or
F2) induces a measure through the change of variables from ξ to x
with respect to the measure μ0 as
μi(Ω)=∫Ω1dx1∫Ω2dx2⋯∫ΩNdxN(g∘Fi-1)(x1,x2,⋯,xN)DFi-1=∫Ωμi(dx),i=1,2.
In our case, each mapping assigns the following measure to a small area dx
in the corresponding state space:
μ1(dx)≡g(F1-1(x))DF1-1λ(dx)=1(2πδt)N/2e-δt2∑n=1Nxn-xn-1δt-f(xn-1)2λ(dx),μ2(dx)≡g(F2-1(x))DF2-1λ(dx)=1(2πδt)N/2e-δt2∑n=1Nxn-xn-1δt-f(xn-12)2+f′(xn)λ(dx),
where f(xn-12)=f(xn)+f(xn-1)2.
Measures μ1 and μ2 represent the occurrence probability of the
noise seen from the state space, and thus can be used for path sampling.
The change-of-measure argument (Appendix ) or the path
integral argument e.g. shows that similar forms
are available for time-continuous and multi-dimensional processes, except
that the term f′(xt) is promoted to divf(xt).
OM functional for mode estimate
If we perform path sampling with a sufficient number of paths, in theory we
can find the mean of distribution by averaging the samples, or the mode of
distribution by organising them into a histogram. Still, in some practical
applications, we must efficiently find the mode of distribution by
variational methods; computationally, this approach is much cheaper than path
sampling. For that purpose, we are tempted to use a quadratic cost function
for the minimisation. However, we can illustrate a simple example against
maximising the path probability () to obtain the mode of
distribution. Suppose we have a discrete-time stochastic system in
R1, starting from x0=0, and we move forward two time steps,
x1=x0+x02δt+ξ0=ξ0,x2=x1+x12δt+ξ1=ξ0+ξ02δt+ξ1,
where ξ0 and ξ1 obey independent normal distributions
N(0,δt). It may be seen as a discrete version of
dxt=xt2dt+dwt. It is easy to notice that the mode of distribution
(x1,x2) is not (0,0) owing to the non-linear term ξ02δt.
On the other hand, according to the path probability (),
μ1(dx1dx2)∝exp-δt2x1-x0δt-x022+x2-x1δt-x122λ(dx1dx2),
the best trajectory is (x1,x2)=(0,0), which has no noise:
(ξ0,ξ1)=(0,0). We expect a path with the highest probability at
(x1,x2)=(0,0), but it is not the route where the paths are most
concentrated.
Motivated by this example, we shall investigate a proper strategy to find the
route that maximises the density of paths. In this regard, we ask how densely
the paths populate in the small neighbourhood of a curve ϕ=ϕ(t) in
the state space.
Assuming that f and ϕ are twice continuously differentiable, we
evaluate the density of paths in the ϵ-neighbourhoods around a curve
ϕ connecting points {ϕn,n=1,2,⋯,N} with the following
integral:
Iϵ,δt(ϕ)=∫ϕ1-ϵϕ1+ϵdx1∫ϕ2-ϵϕ2+ϵdx2⋯∫ϕN-ϵϕN+ϵdxN1(2πδt)N/2exp-δt2∑n=1Nxn-xn-1δt-f(xn-1)2=∫-ϵϵdv1∫-ϵϵdv2⋯∫-ϵϵdvN1(2πδt)N/2exp-δt2∑n=1Nvn-vn-1δt+ϕn-ϕn-1δt-f(vn-1+ϕn-1)2=∫-ϵϵdv1∫-ϵϵdv2⋯∫-ϵϵdvN1(2πδt)N/2exp-δt2∑n=1Nvn-vn-1δt2×exp-δt2∑n=1Nϕn-ϕn-1δt-f(vn-1+ϕn-1)2+2ϕn-ϕn-1δt-f(vn-1+ϕn-1)vn-vn-1δt.
By regarding vn in Eq. () as being generated according to
the probability 1(2πδt)N/2e-δt2∑n=1Nvn-vn-1δt2, we can interpret the integration as a weighted ensemble averaging of a
random function up to a numerical constant. The sequence vn can be set as
a random walk v0=0,vn=∑k=1nξk, where ξk are independent
normal random variables obeying N(0,δt). For simplicity, we
rather assume that ξk takes values ±δt with 0.5
probability for either one, because Donsker's theorem ensures it has the same
probability law as the former when δt is sufficiently small. We
suppose δt<ϵ so that no step of the random walk escapes
from the ϵ-neighbourhood. Accordingly, the integral is expressed as
the ensemble average with respect to random walks confined in the tube
[0,Nδt]×[-ϵ,ϵ]:
Iϵ,δt(ϕ)∝Eve-J(ϕ,v)|(∀n)vn<ϵ,J(ϕ,v)≡δt2∑n=1Nϕn-ϕn-1δt-f(vn-1+ϕn-1)2+2ϕn-ϕn-1δt-f(vn-1+ϕn-1)vn-vn-1δt,
where Ev denotes the ensemble averaging of the random walks
denoted by v, each of which follows the route (v0,v1,⋯,vN) and
satisfies |vn|<ϵ for all n.
Because vn-1 is small, we can apply the expansion
f(vn-1+ϕn-1)=f(ϕn-1)+f′(ϕn-1)vn-1+O(v2),
where f′ is the derivative of f. Let us accept that the following average
containing the higher-order terms O(v2) converges (see Eq. ).
Eve∑n=1NO(v2)(vn-vn-1)|(∀n)vn<ϵ⟶ϵ→01.
As shown in Appendix , the remaining terms in the exponent
-J(ϕ,v) are less than O(ϵ), except for the following one.
∑n=1Nf′(ϕn-1)vn-1vn-vn-1=∑n=1Nf′(ϕn-1)12(vn-1-vn)+12(vn-1+vn)vn-vn-1=∑n=1Nf′(ϕn-1)12(vn-1-vn)vn-vn-1+∑n=1Nf′(ϕn-1)12(vn2-vn-12)=-12∑n=1Nf′(ϕn-1)ξn2+12∑n=1N-1f′(ϕ(tn-1))-f′(ϕ(tn-1+δt))vn2+12f′(ϕN-1)vN2=-δt2∑n=1Nf′(ϕn-1)+O(ϵ2).ξn=±δt,f′(ϕ(tn-1))-f′(ϕ(tn-1+δt))=O(δt),vn2<ϵ2.
Consequently, we obtain the asymptotic expression for the ensemble average
when ϵ is small and δt<ϵ2:
Iϵ,δt(ϕ)∝Eve-δt2∑n=1Nϕn-ϕn-1δt-f(ϕn-1)2+f′(ϕn-1)+O(ϵ)+∑n=1NO(v2)(vn-vn-1)|(∀n)vn<ϵ→e-12∫0Tϕ˙(t)-f(ϕ(t))2+f′(ϕ(t))dt.
Appendix shows that a similar form is available for
time-continuous and multi-dimensional processes, except that the term
f′(ϕ(t)) is promoted to divf(ϕ(t)).
Importantly, the control variable for the optimisation has changed from x
to ϕ.
Probabilistic description of data assimilation
Using the OM functional derived in Sect. and as a
model error term, we shall develop a probabilistic description of data
assimilation.
Following the derivation in Sect. 2.3 of , we can assign
each path a posterior probability
P(x|y)∝P(x)P(y|x)=P(x|x0)P(x0)P(y|x)=∏n=1NP(xn|xn-1)P(x0)∏m∈MP(ym|xm).
According to Eq. (), the prior probability for the initial
condition is given as
P(x0)∝exp-|x0-xb|22σb2,
where |x0-xb|2 represents the squared Euclidean norm
∑i=1D(x0i-xbi)2. According to Eq. (),
the likelihood of the state xm, given observation ym, is
P(ym|xm)∝exp-|ym-xm|22σo2.
Based on the argument in Sect. , Eq. () has the
transition probability at discrete time steps
P(xn|xn-1)∝exp-δt2σ2xn-xn-1δt-f(xn-1)2,
called the Euler scheme, which uses the drift f(xn-1) at the previous
time step. Section also shows that this transition probability
has another expression:
P(xn|xn-1)∝exp-δt2σ2xn-xn-1δt-f(xn-12)2-δt2divf(xn),f(xn-12)≡f(xn)+f(xn-1)2,divf(x)≡∑i=1D∂fi∂xi(x),
which can be called the trapezoidal scheme because the integral is evaluated
with the drift terms at both ends of each interval. The transition
probability leads to the prior probability P(x|x0) of a path
x={xn}0≤n≤N as follows:
P(x|x0)∝exp-δt∑n=1N12σ2xn-xn-1δt-f(xn-1)2⇋exp-δt∑n=1N12σ2xn-xn-1δt-f(xn-12)2+12divf(xn),
where the “⇋” sign indicates that, if δt is
sufficiently small, the equations on both sides are compatible.
On the other hand, based on the argument in Sect. , we can also
define the probability P(Uϕ|ϕ0) for a smooth tube that represents
its neighbouring paths Uϕ=ω|(∀n)|ϕn-xn(ω)|<ϵ:
P(Uϕ|ϕ0)∝exp-δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-1)2+12divf(ϕn-1).
The scaling argument for a smooth curve in Appendix allows us to
use the drift term f(ϕn-12) instead in Eq. ():
P(Uϕ|ϕ0)∝exp-δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-12)2+12divf(ϕn-12).
The corresponding posterior probabilities are thus given as follows:
Ppath(x|y)∝exp(-Jpath(x|y)),Jpath(x|y)≡12σb2x0-xb2+∑m∈M12σo2xm-ym2+δt∑n=1N12σ2xn-xn-1δt-f(xn-1)2⇋12σb2x0-xb2+∑m∈M12σo2xm-ym2+δt∑n=1N12σ2xn-xn-1δt-f(xn-12)2+12divf(xn)
for a sample path, and
Ptube(Uϕ|y)∝P(Uϕ|ϕ0)P(ϕ0)P(y|Uϕ)∝exp(-Jtube(ϕ|y)),Jtube(ϕ|y)≡12σb2ϕ0-xb2+∑m∈M12σo2ϕm-ym2+δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-12)2+12divf(ϕn-12)⇋12σb2ϕ0-xb2+∑m∈M12σo2ϕm-ym2+δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-1)2+12divf(ϕn-1),
for a smooth tube. Note that different pairs of time-discretisation schemes
of the OM functional, 12σ2dxdt-f(x)2+12div(f), are nominated for paths and for tubes in
Eqs. (), (), (), and ().
Results
Example A (hyperbolic model)
In our first example, we solve the non-linear smoothing problem for the
hyperbolic model , which is a simple problem with
one-dimensional state space, but which has a non-linear drift term. We want
to find the probability distribution of the paths described by
dxt=tanh(xt)dt+dwt,xt=0∼N(0,0.16),
subject to an observation y:
y|xt=5∼N(xt=5,0.16),y=1.5.
The setting follows . In this case,
divf(x)=1/cosh2(x) imposes a penalty for small x. The
total time duration T=5 is divided into N=100 segments with
δt=5×10-2.
Figure shows the probability densities of paths normalised on each
time slice, Pt=n(ϕ)=∫P(Uϕ|y)dϕt≠n, derived by
MCMC and PS. PS is performed with 5.1×1010 particles. It is clear
that MCMC with E or TD provides the proper distribution matched with that of
PS; this is also clear from the expected paths yielded by these experiments,
as shown in Fig. . These schemes correspond to candidates in
Eqs. () and (). The expected path by ED bends towards a
larger x, which should be caused by an extra penalty for a larger x. The
expected path by T bends towards a smaller x, which should be caused by the
lack of a penalty for a larger x.
The results of 4D-Var, which represents the MAP estimates, are shown in
Fig. . ED and TD provide the proper MAP estimate. These schemes
correspond to candidates in Eqs. () and (). The
expected paths by E and T bend towards a smaller ϕ, which should be
caused by the lack of a penalty for a larger ϕ.
Probability density of paths derived by MCMC and PS for the hyperbolic model. (a) Reference solution by
PS, (b) solution by MCMC with scheme E or TD, (c) solution by MCMC with scheme ED, and (d) solution by MCMC with
scheme T.
Expected path derived by MCMC (hyperbolic model).
Most probable tube derived by 4D-Var (hyperbolic model).
Example B (Rössler model)
In our second example, we solve the non-linear smoothing problem for the
stochastic Rössler model . We want to find the
probability distribution of the paths described by
dx1=(-x2-x3)dt+σdw1,dx2=(x1+ax2)dt+σdw2,dx3=(b+x1x3-cx3)dt+σdw3,
xt=0∼N(xb,0.04I),
subject to an observation y:
y|xt=0.4∼N(xt=0.4,0.04I),
where (a,b,c)=(0.2,0.2,6),σ=2, xb=(2.0659834,-0.2977757,2.0526298)T, and y=(2.5597086,0.5412736,0.6110939)T. In
this case, divf(x)=x1+a-c imposes a penalty for large x1.
The total time duration T=0.4 is divided into N=800 segments with
δt=5×10-4.
The results by MCMC and 4D-Var for the Rössler model are shown in
Figs. and , respectively. The state variable x1
is chosen for the vertical axes. PS is performed with 3×1012
particles. The curve for PS in Fig. indicates ϕ^=argmaxϕP(ϕ|y), where U represents the tube centred at
ϕ with radius 0.03.
Figure shows that, just as for the hyperbolic model, E and TD
provide the proper expected path. Figure shows that ED and TD
provide the proper MAP estimate.
Expected path derived by MCMC (Rössler model).
Most probable tube derived by 4D-Var (Rössler model).
Applicable OM schemes.
with div(f)
without div(f)
Sampling by MCMC
Euler scheme
✓
trapezoidal scheme
✓
MAP estimate by 4D-Var
Euler scheme
✓
trapezoidal scheme
✓
Towards application to large systems
When one computes the cost value J(x), the negative logarithm of the
posterior probability, in data assimilation, the value f(x) is explicitly
computed by the numerical model while divf(x) is not. If the
dimension D of the state space is large, and f is complicated, the
algebraic expression of divf(x) can be difficult to obtain. The
gradient of the cost function ∇J(x) contains the derivative of
f(x), which can be implemented as the adjoint model by symbolic
differentiation e.g.. However, schemes with the
divergence term require the calculation of the second derivative of f(x),
for which the algebraic expression can be even more difficult to obtain.
Still, there may be a way to circumvent this difficulty by utilising
Hutchinson's trace estimator (see Appendix
). It is also clear that the Euler scheme without the divergence
term is more convenient for implementing path sampling, because it does not
require cumbersome calculation of the divergence term.
Conclusions
We examined several discretisation schemes of the OM functional,
12σ2dxdt-f(x)2+12div(f), for the non-linear smoothing problem
dxt=f(xt)dt+σdwt,x0∼N(xb,σb2I),(∀m∈M)ym|xm∼N(xm,σo2I),
by matching the answers given by MCMC and 4D-Var with that given by PS,
taking the hyperbolic model and the Rössler model as examples.
Table lists the discretisation schemes which were found to be
applicable, i.e. those expected to converge to the same result as the
reference solution. These results are consistent with the literature
e.g..
This justifies, for instance, the use of the following cost function for the
MAP estimate given by 4D-Var:
J=ϕ0-xb22σb2+∑m∈Mϕm-ym22σo2+δt∑n=1N12σ2ϕn-ϕn-1δt-f(ϕn-1)2+12divf(ϕn-1),
where n is the time index, δt is the time increment, xb
is the background value, σb is the standard deviation of the
background value, y is the observational data, σo is the
standard deviation of the observational data, and σ is the noise
intensity. However, the divergence term above should be excluded for the
assignment of path probability in MCMC.
For application in large systems, the Euler scheme without the divergence
term is preferred for path sampling because it does not require cumbersome
calculation of the divergence term. In 4D-Var, the divergence term can be
incorporated into the cost function by utilising Hutchinson's trace
estimator.