Neural networks are able to approximate chaotic dynamical systems when provided with training data that cover all relevant regions of the system's phase space. However, many practical applications diverge from this idealized scenario. Here, we investigate the ability of feed-forward neural networks to (1) learn the behavior of dynamical systems from incomplete training data and (2) learn the influence of an external forcing on the dynamics. Climate science is a real-world example where these questions may be relevant: it is concerned with a non-stationary chaotic system subject to external forcing and whose behavior is known only through comparatively short data series. Our analysis is performed on the Lorenz63 and Lorenz95 models. We show that for the Lorenz63 system, neural networks trained on data covering only part of the system's phase space struggle to make skillful short-term forecasts in the regions excluded from the training. Additionally, when making long series of consecutive forecasts, the networks struggle to reproduce trajectories exploring regions beyond those seen in the training data, except for cases where only small parts are left out during training. We find this is due to the neural network learning a localized mapping for each region of phase space in the training data rather than a global mapping. This manifests itself in that parts of the networks learn only particular parts of the phase space. In contrast, for the Lorenz95 system the networks succeed in generalizing to new parts of the phase space not seen in the training data. We also find that the networks are able to learn the influence of an external forcing, but only when given relatively large ranges of the forcing in the training. These results point to potential limitations of feed-forward neural networks in generalizing a system's behavior given limited initial information. Much attention must therefore be given to designing appropriate train-test splits for real-world applications.

Neural networks are a series of interconnected – potentially nonlinear – functions whose mutual relations are “learned” by the network by training on data. One of their many applications is forecasting the time evolution of dynamical systems. In this context, the neural networks are trained on long time series issued from the dynamical system of interest and can then in principle be used to forecast the system's evolution from new initial conditions. Examples of applications include classical physical systems like the double pendulum

In recent years, neural networks have enjoyed growing attention in climate science. Applications include parameterization schemes in numerical weather prediction and climate models

In this paper, we focus specifically on the widely used feed-forward neural networks and address two open questions related to their use for approximating the dynamics of chaotic systems.

Can neural networks infer system behavior in regions of the phase space not included in the training dataset?

Can neural networks “learn” the influence of an external forcing driving slow changes in the system they are trained on?

Both the questions we raise are of direct relevance to climate applications. Our knowledge of the high-frequency evolution of the climate system issues from comparatively short time series, which only explore a small subset of the possible states of the system. This is particularly true for the ocean, which has much longer characteristic timescales than the atmosphere, and for applications to paleoclimatic variability. Moreover, the accelerating anthropogenic forcing will likely lead to significant changes in the climate's future evolution. The two points we raise are therefore crucial in the context of using neural networks for weather forecasting and for emulating climate models. They could be reformulated in more practical terms, such as whether neural networks have the potential to reproduce unprecedented states of the climate system. Similarly, could they learn the influence of unprecedented greenhouse-gas concentrations on the dynamics of the climate system, given a past record of the system subjected to varying greenhouse-gas levels?

The question of generalization is a central aspect in machine learning and is a well-studied topic for neural networks (e.g.,

The bulk of the literature on the above topics has focused on image recognition and related fields, and the extent to which these results may apply to dynamical systems is unclear. To the authors' knowledge, the generalization properties of neural networks applied to dynamical systems, and specifically to Lorenz systems, are yet to be studied in detail.

Question (1) we framed above relates to whether the network
learns a “global” function mapping the state vector

Even though mathematically equivalent, the latter would imply that different parts of the network are responsible for different regions of the phase space. For some applications this may be irrelevant, as long as the network forecasts work. However, it has major implications for how the network generalizes to regions of the phase space that are not covered in the training data.

Neural networks can tend to overfit – meaning they work very well on the training data but do not generalize and therefore do not work on new data. Therefore, they are usually tested on data not used for the training. Given a dataset, it is not trivial to decide how to split the data into training and test sets. For data without autocorrelation, a random split on a sample-by-sample basis may be suitable. For autocorrelated time series, it is common to split the data into continuous blocks (e.g., using the first 80 % of a time series for training and the last 20 % for evaluation). In a real-world application to the atmosphere, one could train the network on the first years of available observations and then test on the remaining available years (e.g.,

Here, we consider the opposite situation, namely a scenario where the training data cover only part of the system's phase space. We know from the definition of the Lorenz63 and Lorenz95 models that the underlying equations are invariant across the phase space. If the network can truly learn the system's dynamics, and thus successfully approximate the underlying equations, then it should be able to provide useful information concerning the system's behavior in those regions of the phase space not included in the training data. More generally, for a long series of successive forecasts the network should thus be able to reconstruct the full attractor. However, should the network instead learn a set of functions each applicable locally, then one would expect the network to fail in regions not explored during the training. In a climate science context, this would for example be relevant for the ocean. The latter's long characteristic timescales imply that observational datasets may cover only part of the phase space. It is also relevant in forecasting extreme events in the atmosphere.

Question (2) relates to how well a network can learn the influence of a slowly varying variable (the “forcing” in a general sense) on the evolution of the fast-varying variables (the system state). The influence of the slowly varying forcing on the short-term dynamics is potentially very small compared to the typical variability of the systems, making the task of learning simultaneously the dynamics and the influence of the external forcing challenging, even when the forcing is provided as additional input to the network.

The Lorenz63 model

We use

The Lorenz95 model (

For the Lorenz63 model we use fully connected networks with ReLu activation functions in the hidden layers and a linear output layer. The main configuration used in this study was determined via a tuning procedure (Appendix A). It consists of two hidden layers with 128 neurons each. The network takes as input all three Lorenz63 variables and as outputs all three variables one time step later. The training is done with the adam optimizer

For the Lorenz95 model, we use a convolutional network that works on the periodic domain. Convolutional networks have already successfully been used on gridded data from simplified general circulation models in

In most of our experiments, the neural networks are trained by minimizing errors of single-step (and thus short-term) forecasts. Therefore, they may not always reproduce a stable system when making a long series of consecutive forecasts – a known issue when applying neural networks to chaotic systems (e.g.,

This approach is somewhat problematic when training the network on specific regions of the phase space. In principle, we could apply exactly the same procedure to compare the densities of the reconstructed attractor and of the training data. However, for incomplete training data – for example, only one wing of the butterfly – then a perfect reconstruction of the full attractor would fail this test, since the training data include no information beyond the one wing. If the neural network were to learn a “wrong” attractor, namely one that only covers regions close to the wing included in the training, this network would pass the test and be selected, even though it clearly has undesirable characteristics. An alternative approach is to compare the reconstructed attractor with the full attractor. This solves the aforementioned problems, yet is flawed in terms of information availability at time of training. In a real-world setting, we would not know what the full attractor of a complex system – for example our atmosphere – looks like. Nonetheless, in our idealized setting this approach allows us to verify whether the network learns regional or global dynamics. We will hereafter term it the “density-full approach”.

We first verify that our networks can successfully reproduce the Lorenz63 attractor given training data from across the system's phase space. We train 10 networks on a long Lorenz63 simulation (

We next consider the question of training on incomplete data. We take a somewhat drastic approach and we select data that explore only limited regions of the phase space. This selection is done by “cutting out” contiguous regions of the phase space. Since the training is done on data pairs (time steps

Figure

Truncated sets of Lorenz63 training data

Figure

The above suggests that these roughly 20 neurons correspond to the localized mapping part of the network we had speculated about earlier, and deactivating them forces the network to fall back to its global mapping, which we have seen is poor. This test was repeated for different network architectures (different number of hidden layers, and different hidden layer sizes). In all we tested 20 different architectures (smallest: 1 hidden layer with 8 neurons; largest: 8 hidden layers with 128 neurons each). The result for eight of these architectures (ranging from shallow networks with narrow layers to deep networks with wide layers) are shown in Fig.

Boxplots showing the distribution of neuron activations per neuron for hidden layer 1

As Fig.

We next attempt to use neural networks trained on incomplete data to reconstruct the full attractor. We already showed in Sect.

Reconstruction of the Lorenz63 system with neural networks trained on truncated data.

In the Lorenz95 system, which in our setup has a dimensionality of 40, it is harder to define reasonable regions of phase space to be excluded from training than in the three-dimensional Lorenz63 system. A logical step to tackle this problem would be to use a method like principal component analysis to reduce the dimensionality of the system before partitioning its phase space. However, the leading principal component of the Lorenz95 system can only explain 8 % of the variance (not shown), meaning that it is not possible to reduce the system to a small number of principal components while still capturing most of its variance. A different approach is to look at Poincaré sections. These are two-dimensional projections of the phase space spanned by two variables, often used in the analysis of dynamical systems. While this approach seems intuitive, it is problematic in our context. If we define a region of the phase space to leave out of the training (by defining a region spanned by two variables), we can cut out all states of the model run that fall within these regions. However, if there were identical states to these, but shifted one or more grid points, then these states would not be excluded. The symmetry of the system (which also translates to the symmetry in the circular convolutional network architecture used), implies that the network can forecast states excluded from the training data without learning any extrapolation, as long as (near-)identical but shifted states are seen while training. Indeed, due to the circular convolution, original and shifted states are equivalent for the network. Based on these considerations, we use another method to define Poincaré sections of the Lorenz95 system. We first transform the system states to spectral space with a fast Fourier transform (FFT). We then compute the amplitude of each wavenumber (absolute value of the complex wavenumber coefficients), thus removing all information about the position of the waves. We next find the pair of wavenumbers whose amplitudes have least correlation and define a Poincaré section based on these.

Since the Lorenz95 model is very cheap to run, we can also – in analogy to the Lorenz63 experiment – define a phase-space region by setting a certain range for all 40 variables. Due to the low density of data points in such a high-dimensional space, this would exclude only very few points from our standard

To implement the first method we “cut out” squares of the spectral Poincaré section and train a network on the rest of the data. We then use the network to forecast the whole attractor on a test set, and we compare it to the skill of the same network trained on the whole attractor (which has good forecast skill; see Appendix B and Fig. B1). Each training is done 10 times, and the forecast errors averaged over these 10 realizations. The results are shown in Fig.

Networks trained only on part of the Lorenz95 phase space. Short-term forecast errors of the network trained on full training set

For the second method, we remove all training points that lie within the range

As an external “forcing” scenario we consider a gradual linear increase in the

The results are shown in Fig.

Short-term forecast errors for “forcing” experiments with Lorenz63. Networks are trained on a run with

We next consider a variable forcing scenario for the Lorenz95 system. The setup is analogous to the Lorenz63 forcing experiment, but here we change

The results are shown in Fig.

Short-term forecast errors and network attractor reconstructions for forcing experiments with Lorenz95. Networks are trained on a run with linearly increasing

In this study, we explored how well feed-forward neural networks can (1) generalize the behavior of a chaotic dynamical system to its full phase space when trained only on part of said phase space and (2) learn the influence of a slow external forcing on a chaotic dynamical system. Both points are of direct relevance to the application of neural networks in climate science. The climate system is highly chaotic, our observational data likely include only a small portion of the possible states of the system and we are subjecting the system to a slowly varying forcing by emitting large amounts of greenhouse gases. To address these points, we used two highly idealized representations of atmospheric processes, namely the Lorenz63 and Lorenz95 models. We used feed-forward neural network architectures that are shown to work well on these systems when trained on the full phase space and without external forcing.

For the first point we raise, we showed that networks trained on only part of the Lorenz63 attractor are largely unable to reproduce trajectories outside the regions they were trained on. When making short-term forecasts initialized from points in these unknown phase-space regions, the trajectories of the network forecasts point back towards the region included in the training. This makes the forecasts so poor as to be practically useless. Similar issues arise when running a large number of iterated forecasts, so as to reproduce a long trajectory of the system using the neural networks. Again, the network trajectories do not explore regions of the phase space that were not included in the training. The only exceptions are cases where very small regions are excluded from the training data (and determining what the limiting size is of “very small” remains an open question). This implies that using neural networks for emulating climate models, as proposed in

For Lorenz63, we interpret our results as indicating that the neural networks do not learn to approximate the equations underlying the dynamics of the system – which would be akin to a “global mapping” – but rather develop a “regionalized view” of the system, whereby specific neurons contribute to the forecasts in specific regions of the phase space. Thus, when parts of the phase space are left out, the regionalized mapping fails to produce sensible estimates of the system's behavior beyond the regions it has already seen. We confirmed this by inspecting the activations of individual neurons in the trained networks and showed that parts of the network are responsible for specific regions of the phase space. This is similar to findings in the context of image recognition and generation, where different parts of neural networks have been shown to represent different objects/concepts

As a caveat, we note that our experiments, which remove a large contiguous region of phase space from the training data, are more penalizing than what may be expected in a typical climate simulation. It is likely that the regions of the phase space explored by the climate system during the satellite era are more representative of the hypothetical climate attractor than a single wing of the butterfly is for the Lorenz63 system. Indeed, removing a wing is more akin to removing a season from a training set – for example, asking a network to simulate a seasonal cycle without ever being trained on winter data – than having a training set which does not include some rare extreme events – which presumably live in sparsely populated regions of the phase space which need not be contiguous.

An additional challenge in this context that became obvious during the design of our experiments is the choice of criteria to judge successful attractor reconstruction after training. As discussed in the methods section, in order to reconstruct the attractor of a chaotic dynamical system with neural networks, it is not enough to minimize the error of short-term forecasts. Instead, one also needs to judge the trained network on its performance for long series of iterated forecasts, and in particular on whether the resulting trajectories resemble those of the original dynamical system. When the training data only cover part of the phase space, this raises the issue of information availability, as in real-world applications it would not be a valid approach to compare the reconstructed attractor with the full attractor.

All our main experiments were done with feed-forward neural network architectures that forecast the following state of the system. We repeated some of our experiments with networks that forecasted the systems' tendency instead. These were better in producing short-term forecasts in new regions of phase space but had even more trouble in producing stable trajectories outside the training space. While feed-forward architectures are widely used, there are many other architectures available that potentially do not suffer from the issues we found (for example, recurrent architectures, echo-state networks and the related reservoir computers).

To address the second question we raised, we simulated an external forcing on the Lorenz63 and Lorenz95 systems by slowly changing model parameters. We then trained neural networks both with and without the changing model parameters as additional input. Given simulations that span a large enough range of forcing regimes, the networks that use the forcing as input are indeed able to capture at least part of the influence of the forcing, and extrapolate it to some extent to new forcing regimes. The networks again perform better on the Lorenz95 than the Lorenz63 system. This indicates that the idea of emulating climate-change projections with neural networks might not be entirely unrealistic. However, it would be very hard to know beforehand the range of forcing regimes one would need in the training period. Additionally, the networks trained with forcing as an input still perform worse than networks directly trained on the target forcing. Therefore, it may be unwise to apply an architecture that in principle works reasonably on past atmospheric data (like the one proposed by

More generally, our experiments were performed on highly idealized systems, and it is hard to estimate the extent to which they may generalize to more complex systems such as atmospheric general circulation models or even global climate models. Nonetheless,

We hope that this study can provide a starting point for further discussion on the potentials and limitations of neural networks in the context of chaotic dynamical systems. Future studies could expand to more realistic systems (e.g., general circulation atmospheric models) and explore neural network architectures beyond the feed-forward networks used here (e.g., recurrent architectures) and the influence of noisy training data. Additionally, it would be interesting to extend the analysis to the two-level version of the Lorenz95 model, which would allow us to also compare the networks to “truncated” versions of the model. Finally, a more mathematically rigorous approach – as opposed to the empirical approach used here – might shed interesting new light on the topic.

The code used for this study is available in the accompanying Zenodo repository (

The use of neural networks requires a large number of somewhat arbitrary choices to be made before the training of the network even begins. The first step is to select a specific network architecture and choose the so-called hyperparameters. As basic architecture here we chose fully connected layers. Next, we performed an exhaustive grid search over network configurations and hyperparameters. The learning rate was varied from 0.00003 to 0.003, the number of hidden layers from 1 to 10, and the size of the hidden layers from 4 to 128. The activation function was fixed to the rectified linear unit (“ReLu”). A mini-batch size of 32 was used. The training data were normalized to zero mean and unit variance. The tuning was done with a Lorenz63 run with standard parameters, a time step of 0.01 and

For the Lorenz95 model we chose as basic architecture stacked convolution layers, which wrap around the circular domain. The grid search was done over the following parameters: the learning rate was varied from 0.00001 to 0.003; the kernel size of the convolution layers (the “stencil” the convolution operations uses) from 3 to 9; the number of convolution layers from 1 to 9; and the depth of each convolution layer from 32 to 128. Furthermore, both sigmoid and “ReLu” activation functions were tested. A mini-batch size of 32 was used. The training data were normalized to zero mean and unit variance.

The tuning was done with a Lorenz95 run with

The network architecture trained on a time step of 0.1 made the best forecasts over lead times of up to

Evaluation of network architecture for the Lorenz95 system without

Same as Fig. 2d–f but for networks forecasting the tendency instead of the following state. Note the different color scale in

Same as Fig. 3b, e but showing all neurons without specific ordering.

Both authors developed the ideas underlying this study. SS designed the study, implemented the software, performed the analysis and drafted the manuscript. Both authors helped in interpreting the results and improving the manuscript.

The authors declare that they have no conflict of interest.

SS was funded by the Dept. of Meteorology of Stockholm University. The computations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) at the High Performance Computing Center North (HPC2N) and National Supercomputer Centre (NSC).

This research has been supported by Vetenskapsrådet (grant no. 2016-03724).The article processing charges for this open-access publication were covered by Stockholm University.

This paper was edited by Stefano Pierini and reviewed by three anonymous referees.