In recent years, several climate subsystems have been identified that may undergo a relatively rapid transition compared to the changes in their forcing. Such transitions are rare events in general, and simulating long-enough trajectories in order to gather sufficient data to determine transition statistics would be too expensive. Conversely, rare events algorithms like TAMS (trajectory-adaptive multilevel sampling) encourage the transition while keeping track of the model statistics. However, this algorithm relies on a score function whose choice is crucial to ensure its efficiency. The optimal score function, called the committor function, is in practice very difficult to compute. In this paper, we compare different data-based methods (analog Markov chains, neural networks, reservoir computing, dynamical Galerkin approximation) to estimate the committor from trajectory data. We apply these methods on two models of the Atlantic Ocean circulation featuring very different dynamical behavior. We compare these methods in terms of two measures, evaluating how close the estimate is from the true committor and in terms of the computational time. We find that all methods are able to extract information from the data in order to provide a good estimate of the committor. Analog Markov Chains provide a very reliable estimate of the true committor in simple models but prove not so robust when applied to systems with a more complex phase space. Neural network methods clearly stand out by their relatively low testing time, and their training time scales more favorably with the complexity of the model than the other methods. In particular, feedforward neural networks consistently achieve the best performance when trained with enough data, making this method promising for committor estimation in sophisticated climate models.

Global warming may lead to the destabilization of certain subsystems of the climate system, called tipping elements

In the case of the AMOC, the salt-advection feedback is known to cause a bistable regime in conceptual ocean models

The collapse of the AMOC is thought to be a rare event, but because of its high impact, it is important to compute the probability of its occurrence in the 21st century. The theoretical framework of large deviation theory

A good alternative is to use splitting, or cloning algorithms such as trajectory-adaptive multilevel splitting (TAMS)

The main limitation of this kind of algorithm is that it heavily relies on its score function; using a bad score function may cancel its time-saving benefit.
Fortunately, in the case of TAMS, the optimal score function is known: it is the committor function

When it is assumed that the underlying dynamics can be described by an overdamped Langevin equation, the backward Kolmogorov equation may be simplified. The committor can then be parametrized using, for instance, feedforward neural networks

The contribution of the present paper is to compare the capabilities and performance of these different committor estimation methods by applying them to two different conceptual, low-dimensional ocean models. The objective is to assess their strengths and weaknesses and determine which one could be best suited for applying the committor estimation within TAMS for high-dimensional models. The structure of the paper is as follows. In Sect. 2, we shortly present both ocean models for which we estimate the committor. We also explain the methods that will be compared, detail the choices made for their implementation, and outline our comparison protocol. Results on the performance of the different committor estimation methods are presented in Sect. 3. In Sect. 4, we discuss possible ways of optimization and future lines of improvement.

Sketch of the AMOC box model

The AMOC box model used here was presented in

The flows between the boxes are represented by three main quantities.
Firstly, the volume transport

The model is subject to two forced freshwater fluxes: a constant symmetric forcing,

The double-gyre model is a well studied model of the wind-driven ocean circulation in a rectangular
basin of size

Consider two sets

One can directly sample the committor function via a Monte Carlo method. Suppose we have determined
a trajectory

The analog method was first proposed in

Let

This set of analogs is thus a subset of the original trajectory:

AMC thus returns an estimate of the committor on every point of the input trajectory. Since all the information only comes from the transition matrix computed from that trajectory, AMC does not require any pre-training (in contrast to the machine learning methods described below) and could in theory be applied directly to any trajectory computed in TAMS. However, restarting the whole process from scratch for each of the hundreds of trajectories simulated during TAMS may be computationally expensive.

In order to estimate the committor at any other point of the phase space (not belonging to the train trajectory), the AMC method has to be combined with another method. As suggested in

Suppose an estimate

When applying AMC, as explained in

FFNNs are frequently used to perform a classification task: each data sample is labeled (often with a binary label) as belonging to a class and the FFNN must learn the different classes. Here, however, to estimate a probability, an extra layer shall be added at the end of the network.
First, all data samples must be labeled in both the train and test set. Following the same convention as in Sect. 2.3, two classes are used: “leading to a state in

The FFNN itself consists of several hidden layers of densely connected neurons, preceded by an input layer and followed by an output layer. Our baseline architecture contains three hidden layers of respectively

The loss function used to train the FFNN is the cross-entropy loss. It is well suited to assess a distance between the true committor and the data-based estimation. Moreover, it is closely related to the measures we are using to evaluate the performance of the different methods (see Sect.

Unlike the training of AMC, training an FFNN involves randomness (e.g., shuffling of the train dataset). To ensure robustness of the results, we use

The AMC method is easy to optimize, as it involves a single hyperparameter. However, for the FFNN, there are many
more parameters that can be varied. We choose the following setup and hyperparameters:

Each layer of the neural network is initialized according to the

The optimization algorithm is the stochastic gradient descent method.

We use a learning rate scheduler, with the plateau algorithm: if the loss function is not improved for

Each learning lasts

Reservoir computing was first introduced by

A classical reservoir computer consists of three main elements: an input layer matrix

Recently,

Consider a trajectory

Appendix C explains how the feature vector

The dynamical Galerkin approximation (DGA) method as implemented here is based
on

The first step is to homogenize the boundary conditions in Eq. (

In practice, the modes are computed from a training set. Then, they are extended on the trajectories where the committor is to be estimated using an approximation formula provided by

We are not only looking for a method that best estimates the committor function but also for one that is most time efficient. We will therefore use several measures to compare them: the logarithm score, the difference score, and the computation time. In this section, we give more details about the first two.

Let

The logarithm score is defined

Let

For better interpretability,

The difference score is simply the squared difference between the estimated committor (called

The major drawback of using the difference score is that it is in general not computable. Indeed, in the general case, we do not know the true committor. In this paper, however, thanks to the low dimensionality of our example models, we can determine the true committor with a Monte Carlo method. We can use this score here in the comparison of the different methods, but in more complex settings we will have to rely entirely on the normalized logarithm score.

We will now compare the different methods used to estimate the committor on both ocean models.
Assessing the measures of the committor estimate and the computation time for each method enables
us to assess which one is the best and seems the most promising for future applications, on
high-dimensional models in particular.
All results presented below were computed on a Mac M1 CPU using Python

As detailed in

The expressions of

We are interested in the probability that the AMOC collapses, so the committor function is here defined as the probability that a trajectory reaches the off state's zone before the on state's zone. Here, the subspace

In the following study of this model, all trajectories will be computed with

To be able to compare different methods for the committor estimation, we need to train them and to test them with different trajectories. We thus need to create a training and a test dataset.
For simplicity and consistency, we set a standard length for all test trajectories of both models:

When it comes to training methods for the committor estimation problem, what really matters is to have reactive trajectories; that is, trajectories going through both on and off states. Consequently, it makes sense to count the length of trajectories in terms of the number of transitions

One of our goals is to estimate the committor function using as little data as possible. It is thus interesting to study how the performance of each method scales with the amount of training data. To do so, we generate several training sets, having an increasing number of transitions

In the case of the AMOC model, generating a trajectory with

The performance of each method when applied to the AMOC model will be presented in the next two sections, but first, we specify some implementation details.

The training of AMC, DGA, and RC does not involve randomness; hence we perform it once only on the entire training set containing

In the case of RC, the parameters

For optimal results, the different methods are also applied to different sets of variables, which shows that the methods capture different features of the phase space as follows:

Comparison of the performance and computation time of all four methods on the AMOC box model.

The normalized logarithm score for each method is presented in Fig.

AMC is fairly easy to optimize, since it relies on a single parameter: the number of analogs

DGA has an even better score than AMC for

However, the most important feature of the normalized logarithm score of DGA is that it decreases as

The performance of the FFNN (orange) is exactly the expected one, as it is well known that machine learning poorly performs when trained with too little data. Here, FFNN yields a normalized logarithm score of

The other machine learning technique, RC (purple), also has a low score when trained with insufficient data and it increases with the size of the training set. However, for

The difference score for these methods is shown in Fig.

AMC performs clearly better on average than FFNN and RC for

Finally, once again, the score of DGA decreases as

However, the scores ranking between AMC, FFNN, and RC might not be so meaningful, since the largest difference between their average scores is not as large as the error bars of FFNN, the narrowest of all.

Figure

We provide below an evaluation of the scaling of the training time of our implementation of each method. We call

For AMC, building the KD tree is at worst

Training DGA consists in two steps: computing a diffusion map kernel matrix

The training time of FFNN mostly depends on the architecture of the neural network. The only dependence on

For RC, the training time heavily depends on the size of the reservoir, which does not depend on

To test AMC, we need to build a KD tree using the training set, which is about

For the testing of DGA, we need to extend the trained modes

Once again for the FFNN, the testing time only depends on the network's architecture, and the test samples are given in sequential order, hence the testing time scales as

Testing the RC scales as

For each value of

As for RC (purple), it is interesting to see how fast the training of this machine learning method is. The training time is the main fallback of the FFNN because of the weights-updating algorithm run after each batch and the large number of weights. In the case of RC, the training mostly amounts to computing the nonlinear features that are extracted from the data at every time step. This can be achieved in a single NumPy operation for the whole training set, which is thus very well optimized. As a result, RC scales much better with an increasing

DGA is the method whose training time has the worst scaling with

The different methods can be separated into two groups when looking at their testing times (Fig.

Once again, the testing time of DGA scales with the size of the training set and quickly blows up as well. Indeed, the modes used during the test phase are of the size of the training set, which makes the testing phase increasingly time-expensive. For

In the second group, for

For

Although of lower dimension than the AMOC model, the double-gyre model exhibits a phase space with a more complex structure. Figure

In the following study of this model, all trajectories will be computed with

The generation procedure of the train and test datasets used for studying the double-gyre model is similar to that used for the AMOC model.
The standard length of all trajectories is

In the case of the double-gyre model, generating the training sets is simpler than for the AMOC model, as we do not need to pay attention to different collapsed states. We can just simulate a single very long trajectory containing

Once again, we will compare in the following sections the performance of each method against the size of the training set. AMC, DGA, and RC were applied on the full training sets. We performed

However, for this model, the implementation of AMC is slightly different than for the AMOC model. As is done in

For the double-gyre model, all methods are applied on the full set of variables:

Comparison of the performance and computation time of all four methods on the double-gyre model.

First of all, we can see that the curves of AMC in all plots of Fig.

Here, FFNN performs even more poorly as in the AMOC model when trained with too little data (score of

Once again, the curve of DGA (shown in green) stops at

RC also has a decreasing normalized logarithm score as

The difference score of each method (Fig.

DGA is only the best-performing method for

For

In this model, the fixed amplitude of the noise has a stronger effect on the trajectories than the noise amplitude used in the AMOC box model. As a consequence, trajectories will spend a longer time in the transition zone and explore larger areas of the phase space before reaching an on or off state. The committor along this trajectory will oscillate accordingly, as the trajectory moves closer to the on state or off state. This causes the average logarithm score of the Monte Carlo estimate of the committor to decrease in this model, as observed in Fig.

The computation times for the double-gyre model are shown in Fig.

As was already the case for the AMOC model, the training time of RC scales better with the size of the training set than that of FFNN (Fig.

During the training of AMC, due to the fast growing size of the matrix

As a comparison, between

For

The testing times are shown in Fig.

As was already the case in the AMOC model, FFNN also has a very short testing time of about

The present work intends to evaluate and compare several existing methods to estimate the committor function from trajectory data. Having a good estimate of the committor function is crucial in order to ensure maximum efficiency and accuracy of a rare event algorithm such as TAMS. Using these kind of algorithms is a very promising solution to the computation of probabilities of rare transitions in complex dynamical systems, such as a potential collapse of the AMOC in high-dimensional ocean-climate models. We compared the analog method (AMC) with a simple feedforward neural network (FFNN), a reservoir computing (RC) method, and a dynamical Galerkin approximation (DGA) scheme. Two models, an AMOC box model and a double-gyre model, were used for their evaluation, where the phase-space dynamics of the double-gyre model is more complex than that of the AMOC model.

Although efficient in the AMOC model, AMC is very slow and not so robust in more complex settings such as the double-gyre model. The sampling of the phase space indeed becomes difficult when it displays complex structures. This result may be related to what

FFNN proves a very robust method that can adapt to complex phase spaces. Its main drawback is the time it takes to be trained and the amount of data needed to obtain an adequate estimate of the committor. However, once trained, it is a very fast method that also provides the best estimate of the committor. The RC method is the most naive of all, extracting nonlinear features from a trajectory and performing a linear regression on them. This method is strikingly efficient, considering how simple it is. When well optimized, its results may compete with those of the FFNN, but it is much faster to train. However, it has a limited learning capacity and reaches a performance plateau, which makes a difference with FFNN when trained with a lot of data.

The DGA method shows a strange behavior that we could not explain: its performance decreases as the size of the training set increases. However, for the lowest value of

We compared these methods using two scores: the normalized logarithm score and the difference score. Although the latter is easier to interpret, it will in general never be computable because it requires knowing the true committor. For more complex models, we will thus have to rely on the normalized logarithm score. We found that they do not provide the exact same information: in particular, they rank the methods differently. However, in general, the improvement in the skill of most methods can be read accordingly in both scores. We only found one exception: RC in the case of the double-gyre model, where the normalized logarithm score wrongly indicates a loss in performance as

By applying rare event algorithms to more sophisticated, high-dimensional models, it is likely that long (and expensive) simulations contain few or even no transitions. This is a major problem because AMC, FFNN, and DGA all rely on reactive trajectories to be trained and then estimate the committor. If there are no transitions in the data, the neural network only sees one class of events, so it can never predict a transition. In addition, it can be easily demonstrated that AMC and DGA fail as well. So, we may need extra long costly simulations to be able to apply these methods, all the more so as AMC and FFNN require a sufficient number of transitions to be trained properly.

In this setting, DGA and RC are promising. Indeed, although DGA needs transitions to be trained, we showed that much less data are required than for any other method in order to obtain good results. Moreover, we do not necessarily need to see complete transitions in the trajectories. Only certain relevant areas in the phase space need to be explored and sampled (although determining which ones precisely is not obvious in general), which can be done by stacking shorter trajectories that do not necessarily transition from one state to another. This approach is the one developed in

Another important question that arises is how the computation time of these methods scale with an increasing number of dimensions. This problem is equally as important as the scaling of the computation time with the size of the dataset in the (realistic) case where we have a model with very high dimension because it often implies we cannot compute long trajectories.
AMC involves the model dimension during the computation of the analogs (for the train and test phase). Then the using a KD tree does not show significant improvement over a brute force search. But after this step, AMC only deals with the number assigned to every analog and thus only depends on the size of the dataset.
DGA only consists of multiplying the vectors that contain the modes, which take their values in

The next step of this work would be to combine these data-driven estimates of the committor function with TAMS to actually compute rare event probabilities. However, it requires some extension of the present study. We already mentioned the problem of regimes where long trajectories contain only few to no transitions. Moreover, we may want to compute transition probabilities for different parameters of the model, as is done by

A related approach consists in building a feedback loop between a rare event algorithm and a data-driven committor function estimation method. The estimate of the committor yielded by the latter is used by the former to generate more data in order to improve the committor estimate. This idea has already been implemented by

Another extension of this work would be to consider non-autonomous dynamics. Once again, this extension can be achieved from several viewpoints. Firstly, if the objective is to compute the probability that the AMOC collapses within a certain time frame,

The long-term objective of such a study and extensions would be to apply TAMS and committor function estimation to much more complex models, such as Earth system models of intermediate complexity (EMICs) or even general circulation models (GCMs). Their complexity is nowhere comparable to the models featured in this work, with a dimensionality of the order of at least

The equations of the AMOC model are

Reference constants and parameters of the AMOC model.

The stream function of the double-gyre is written as

Fixed parameters of the double-gyre model.

In this Appendix, we explain how

Secondly,

In the general case, the shape of

Parameters of the new-generation reservoir network for both models.

The basis functions are computed from a transition matrix, itself computed from a reactive trajectory as follows:

Finally,

Parameters of the dynamical Galerkin approximation for both models.

The Python implementation of both models, all methods, and the code producing the result plots can be found at the following address:

All data sets mentioned in the article were created and can be reproduced using the code provided in the “Code availability” section.

All authors conceived the study. VJD carried out the computations, generated all figures, and wrote the first draft of the paper. All authors contributed to the final paper.

The authors declare that they have no conflict of interest.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no. 956170. The work of Freddy Bouchet was supported by the ANR grant SAMPRACE, project ANR-20-CE01-0008-01.

This research has been supported by the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant (grant no. 956170).

This paper was edited by Stéphane Vannitsem and reviewed by two anonymous referees.