Recent progress in machine learning has shown how to forecast and, to some extent, learn the dynamics of a model from its output, resorting in particular to neural networks and deep learning techniques. We will show how the same goal can be directly achieved using data assimilation techniques without leveraging on machine learning software libraries, with a view to high-dimensional models. The dynamics of a model are learned from its observation and an ordinary differential equation (ODE) representation of this model is inferred using a recursive nonlinear regression. Because the method is embedded in a Bayesian data assimilation framework, it can learn from partial and noisy observations of a state trajectory of the physical model. Moreover, a space-wise local representation of the ODE system is introduced and is key to coping with high-dimensional models.

It has recently been suggested that neural network architectures could be interpreted as dynamical systems. Reciprocally, we show that our ODE representations are reminiscent of deep learning architectures. Furthermore, numerical analysis considerations of stability shed light on the assets and limitations of the method.

The method is illustrated on several chaotic discrete and continuous models of various dimensions, with or without noisy observations, with the goal of identifying or improving the model dynamics, building a surrogate or reduced model, or producing forecasts solely from observations of the physical model.

Data assimilation aims at estimating the state of a physical system from its observation and a numerical dynamical model for it.
It has been successfully applied to numerical weather and ocean prediction, where it often consisted in estimating the initial
conditions for the state trajectory of chaotic geofluids

Model errors can take many forms, and accounting for them depends on the chosen data assimilation scheme.
A first class of solutions relies on parametrising model error by, for instance, transforming the problem into a physical parameter estimation problem

These approaches essentially seek to correct, calibrate, or improve an existing model using observations. Hence, they all primarily make use of data assimilation techniques.

An alternative is to renounce physically based numerical models of the phenomenon of interest and instead to only use observations of that system. Given the huge required datasets, this may seem a far-reaching goal for operational weather and ocean forecasting systems, but recent progress in data-driven methods and convincing applications to geophysical problems of small to intermediate complexity are strong incentives to investigate this bolder approach. Eventually, the perspective of putting numerical models away has a strong practical appeal, even though such a perspective may generate intense debates.

For instance, forecasting of a physical system can be done by looking up past situations and patterns using the techniques of analogues, which can be combined with present observations using data assimilation

Data-driven techniques that seek to represent the model in a more explicit manner, and therefore with a greater interpretability, may use specific classes of nonlinear regression
as advocated by

From this point on, the physical system under scrutiny will be called the

Importantly, we will not require any machine learning software tool since the adjoint of the model resolvent can be derived without a lot of effort.
As opposed to the contributions mentioned in the previous subsections, we embed the technique in a data assimilation framework.
From a data assimilation standpoint, the technique can be seen as meant to deal with model error (with or without some prior on the model)
and it naturally accommodates partial and noisy observations. Moreover, we will build representations of the dynamics that are either invariant by spatial translation (homogeneous) and/or local (i.e. the flow rate of a variable

In Sect.

In Sect.

In Sect.

Our surrogate model is chosen to be represented by an ODE system as described by Eq. (

In the absence of any peculiar symmetry, we choose this map to list all the monomials up to second order built
on

As a result, the regressors are compactly defined by

Higher-order regressors, as well as regressors of different functional forms, could be included as in

At least two useful simplifications for the ODEs could be exploited if the state

First, we use a locality assumption based on the physical locality of the system: all multivariate monomials in the ODEs have variables

Let us take the example of a one-dimensional extended space as those used in Sect.

The row

In Appendix

Note that this locality assumption is hardly restrictive. Indeed, owing to the absence of long-range instantaneous interactions (which are precluded in geophysical fluids),
farther distance correlations between state variables can be generated by small stencils in the definition of

Furthermore, a symmetry hypothesis could optionally be used by assuming translational invariance of the ODEs, called

Let us enumerate its coefficients in the case of the L96 model with

Note that while both constraints, locality and homogeneity, apply to the ODEs, they do not apply to the states per se. For instance, ODEs for discretised homogeneous two-dimensional turbulence satisfy both constraints and yet generate non-uniform flows.

For realistic geofluids, the forcing fields (solar irradiance, bathymetry, boundary conditions, friction, etc.) are heterogeneous, so that the homogeneity assumption should be dropped. Nonetheless, the fluid dynamics part of the model would remain homogeneous. As a result, a hybrid approach could be enforced.

The reference model will be observed at time steps

We define intermediate state vectors in between

Representation of the data assimilation system as a hidden Markov chain model and of the model resolvents

The operator

In the following,

We consider a sequence of observation vectors

With the goal of identifying a model or building a surrogate of the reference one, we are interested in estimating the probability density function (pdf)

In the case where the reference model is fully and directly observed, i.e.

The data assimilation system is represented in Fig.

To efficiently minimise the cost function Eq. (

The computations of the gradients and the required adjoints are developed in Appendix

In this section, we discuss the prior pdf

The goal is either to reconstruct an ODE for the reference model, characterised by the coefficients

The prior pdf

The success of the optimisation of

This says that, in the absence of a strong prior

As for the sensitivity on

This stability issue can be somehow alleviated by normalising the observations

Moreover, instabilities can significantly be mitigated by replacing the monomials with smoothed or truncated ones:

This latter change in variables is the one implemented for all numerical applications described in Sect.

It has recently been advocated that residual deep learning architectures
of neural networks can roughly be interpreted as dynamical systems

By contrast, we have started here from a pure dynamical system standpoint and proposed to use data assimilation techniques.
In order to explore complex model resolvents, applied to each interval

Following this analogy, the analysis step where

Because of our complete control of the backpropagation, we hope for a gain in efficiency. However, our method does not have the flexibility of deep learning through established tools. For instance, addition of extra parameters, adaptive batch normalisation, and dropouts are not granted in our approach without further considerations.

Convolutional layers play the role of localisation in neural architecture. In our approach this role is played by the locality assumption and its stencil prescription. Recall that a tight stencil does not prevent longer-range correlations that are built up through the integration scheme and their composition. This is similar to stacking several convolutional layers to account for multiple scales from the reference model which the neural network is meant to learn from.

Finally, we note that, as opposed to most practical deep learning strategies with a huge amount of weights to estimate, we have reduced the number of control variables (i.e.

In this section, we shall consider four low-order chaotic models defined on a physical one-dimensional space, except for L63, which is

The L63 model as defined by the ODEs:

The Lyapunov time is defined as the inverse of the first Lyapunov exponent, i.e. the typical time over which the error grows by a factor

The L96 model as defined by ODEs defined over a periodic domain of variables indexed by

The KS model, as defined by the PDE:

The two-scale Lorenz model

This model is of interest because the variable

The numerical experiments consist of three main steps. First, the truth is generated, i.e. a trajectory of the reference model is computed. The reference model equations are supposed to be unknown, but the trajectory is observed through Eq. (

Next, estimators of the ODE model and state trajectory

Finally, we can make forecasts using the tentative optimal ODE model

The integration time step of the truth (reference model) is

The integration time step of the surrogate model within the training time window

The three steps of the numerical experiments are depicted in Fig.

Schematic of the three steps of the experiments, with the associated time steps (see main text). The beginning of the forecast window may or may not coincide with the end of the training window. The lengths of the segments

In the first couple of experiments, we consider a densely observed

We choose the qualifier

Let us first experiment with the L63 model, using an RK4 integration scheme, with

A similar experiment is carried out with the L96 model, using an RK4 integration scheme, with

To compute the RMSE as a function of the forecast lead time, we average over

In this second couple of experiments, we consider again a densely observed reference model with noiseless observations.
The reference model trajectory is generated by the L96 model (

As opposed to the reference model, in these non-identifiable model experiments, the surrogate model is based on the RK2 scheme, with

Average RMSE of the surrogate model (L96 with an RK2 structure) compared to the reference model (L96 with an RK4 integration scheme) as a function of the forecast lead time (in Lyapunov time unit) for an increasing number of compositions.

Figure

Density plot of the L96 reference and surrogate model trajectories, as well as their difference trajectory, as a function of the forecast lead time (in Lyapunov time unit). The observations are noiseless and dense; the model is not identifiable.

Next, the reference model trajectory is generated by the KS model (

Figure

Density plot of the KS reference and surrogate model trajectories, as well as their difference trajectory, as a function of the forecast lead time (in Lyapunov time unit). The observations are noiseless and dense; the model is not identifiable.

To check whether the PDE of the KS model could be retrieved in spite of the differences in the method of integrations and representations, we have computed a Taylor expansion of all monomials in the surrogate ODE flow rate up to order

Coefficients of the surrogate PDE model (blue) resulting from the expansion of the surrogate ODEs and compared to the reference PDE's coefficients (orange).

We come back to the L96 model, which is densely observed but with noisy observations that are generated
using an independently identically distributed normal noise. The surrogate model is based on an RK4 scheme

Figure

Average RMSE of the surrogate model (L96 with an RK4 structure) compared to the reference model (L96 with an RK4 integration scheme) as a function of the forecast lead time (in Lyapunov time unit) for a range of observation error standard deviations

Even though, in this configuration, the model is identifiable, the reference value

This is confirmed by Fig.

Gap between the surrogate (L96 with an RK4 structure) and the (identifiable) reference dynamics (L96 with an RK4 integration scheme) as a function of the observation error standard deviation

Using the same setup, we have also reduced the number of observations. The observations of grid point values are regularly spaced and shifted by one grid cell at each observation time step. The initial

If the observations are noiseless, the reference model is easily retrieved to a high precision down to a density of

We would like to point out that in the case of noiseless observations, the performance depends little on the length of the training window, beyond a relatively short length, typically

Figure

L96 is the reference model, which is fully observed without noise: plot of the

Figure

L96 is the reference model, which is fully observed with observation error standard deviation

Density plot of the L05III reference and surrogate model trajectories, as well as their difference trajectory,
as a function of the forecast lead time (in Lyapunov time unit). Panel

In this experiment, we consider the L05III model. With the locality and the homogeneity assumptions, the scalability is typically linear with the size of the system, and we actually consider the

Figure

The emergence of error, i.e. the divergence from the reference, appears as long darker stripes on the density plot of the difference (close to zero difference values appear as white or light colour).
We argue that these stripes result from the emergence of sub-scale perturbations that are not properly represented by the surrogate model.
Reciprocally there are long-lasting stripes of low error not yet impacted by sub-scale perturbations.
As expected, and similarly to the L96 model, the perturbations are transported eastward, as shown by the upward tilt of the stripes in Fig.

We have proposed to infer the dynamics of a reference model from its observation using Bayesian data assimilation, which is a new and original scope for data assimilation. Over a given training time window, the control variables are the state trajectory and the coefficients of an ODE representation for the surrogate model. We have chosen the surrogate model to be the composition of an explicit integration scheme (Runge–Kutta typically) applied to this ODE representation. Time invariance, space homogeneity, and locality of the dynamics can be enforced, making the method suitable for high-dimensional systems. The cost function of the data assimilation problem is minimised using the adjoint of the surrogate resolvent which is explicitly derived. Analogies between the surrogate resolvent and a deep neural network have been discussed as well as the impact of stability issues of the reference and surrogate dynamics.

The method has been applied to densely noiseless observed systems with identifiable reference models yielding a perfect reconstruction close to machine precision (L63 and L96 models). It has also been applied to densely or partially observed, identifiable or non-identifiable models with or without noise in the observations (L96 and KS models). For moderate noise and sufficiently dense observation, the method is successful in the sense that the forecast is accurate beyond several Lyapunov times. The method has also been used as a way to infer a reduced model for a multi-scale observed system (L05-III model). The reduced model was successful in emulating slow dynamics of the reference model, but could not properly account for the impact of the fast unresolved scale dynamics on the slow ones. A subgrid parametrisation would be required or would have to be inferred.

Two potential obstacles have been left aside on purpose but should later be addressed. First, the model error statistics have not been estimated. This could be achieved using for instance an empirical Bayesian analysis built on an ensemble-based stochastic expectation maximisation technique. This is an especially interesting problem since the potential discrepancy between the reference and the surrogate dynamics is in general non-trivial. Second, we have used relatively short training time windows. Numerical efficient training on longer windows will likely require use of advanced weak constraint variational optimisation techniques.

In this paper, only autonomous dynamics have been considered. We could at least partially extend the method to non-autonomous systems by keeping a static part for the pure dynamics and consider time-dependent forcing fields. We have not numerically explored non-homogeneous dynamics, but we have shown how to learn from them using non-homogeneous local representations.

A promising yet challenging path would be to consider implicit or semi-implicit schemes following for instance the idea in

If observations keep coming after the training time window, then one can perform data assimilation using the ODE surrogate model of the reference model. This data assimilation scheme could only focus on state estimation or it could continue to update the ODE surrogate model for the forecast.

No datasets were used in this article.

In this Appendix, we show how to parametrise

For the bias sector, we have

For the linear sector, we have

Finally, for the bilinear sector, we have

We observe that these indices

It will be useful in the following to consider the variation of each

We first consider the situation when the observation interval corresponds to one
integration time step of the surrogate model, i.e.

Second, let us look at the gradient of

We now consider a resolvent which is the composition of

Second, we look at the gradient with respect to

All of these results, Eqs. (B10), (

As an alternative to the explicit computation of the gradients of Eq. (

MB first developed the theory, implemented, ran, and interpreted the numerical experiments, and wrote the original version of the manuscript. All authors discussed the theory, interpreted the results, and edited the manuscript. The four authors approved the manuscript for publication.

The authors declare that they have no conflict of interest.

The authors are grateful to two anonymous reviewers and Olivier Talagrand acting as editor for their comments and suggestions. Marc Bocquet is thankful to Said Ouala and Ronan Fablet for enlightening discussions. CEREA and LOCEAN are members of the Institut Pierre-Simon Laplace (IPSL).

This research has been supported by the Norwegian Research Council (project REDDA (grant no. 250711)).

This paper was edited by Olivier Talagrand and reviewed by two anonymous referees.