In geosciences, multi-model ensembles are helpful to explore the robustness of a range of results. To obtain a synthetic and improved representation of the studied dynamic system, the models are usually weighted. The simplest method, namely the model democracy, gives equal weights to all models, while more advanced approaches base weights on agreement with available observations. Here, we focus on determining weights for various versions of an idealized model of the Atlantic Meridional Overturning Circulation. This is done by assessing their performance against synthetic observations (generated from one of the model versions) within a data assimilation framework using the ensemble Kalman filter (EnKF). In contrast to traditional data assimilation, we implement data-driven forecasts using the analog method based on catalogs of short-term trajectories. This approach allows us to efficiently emulate the model's dynamics while keeping computational costs low. For each model version, we compute a local performance metric, known as the contextual model evidence, to compare observations and model forecasts. This metric, based on the innovation likelihood, is sensitive to differences in model dynamics and considers forecast and observation uncertainties. Finally, the weights are calculated using both model performance and model co-dependency and then evaluated on averages of long-term simulations. Results show good performance in identifying numerical simulations that best replicate observed short-term variations. Additionally, it outperforms benchmark approaches such as strategies based on model democracy or climatology when reconstructing missing distributions. These findings encourage the application of the proposed methodology to more complex datasets in the future, like climate simulations.

In the geosciences, several numerical models are usually available to represent the same complex system. For example, the Earth's global climate system is implemented in a range of different numerical models that are, for some, gathered within the Couple Model Intercomparison Project (CMIP;

Nevertheless the question of how to gather the information produced by the range of models remains. In this context, it is usual to follow a model democracy approach, giving equal weight to all models to enhance the representativeness. This is what is mostly done in the 6th Assessment Report (AR6) of the Intergovernmental Panel on Climate Change (

A trade-off between MME-based model democracy and selecting a single model would be to weight-average the models. The weights can be found through model performance, which consists in evaluating the consistency between model simulations and available observations. This procedure should be adapted to the specificities of model simulations being processed (

The first approaches are based on descriptive statistics. This is the case of “reliability ensemble averaging” (

Second approaches are included in a probabilistic Bayesian framework, such as “Bayesian model averaging” (BMA; e.g.,

In this study, we evaluate the performance of competing models based on their ability to reproduce short-term dynamics of the system described by observed sequences. To achieve this, a data-driven approach is adopted within a data assimilation framework (

This study aims to develop a weighting methodology. As a first step, we use numerical simulations of an idealized chaotic deterministic model representing the centennial-to-millennial dynamic of the Atlantic Meridional Overturning Circulation (AMOC). The proposed alternative approach is compared to more classic approaches, such as model democracy, climatology, or a single best model.

The study is organized as follows. Section

This section first describes the framework to measure the ability of a single dynamical model to fit a set of noisy observations. After applying it to an ensemble of competing models, the second part is devoted to the strategies for computing model weights and their application.

Here, we evaluate the model performance based on its short-term dynamics. For this purpose, initialized model forecasts are crucial to estimate the accuracy of the dynamics with regards to observations. By synchronizing model forecasts to available observations, Bayesian data assimilation (DA) is a suitable framework for addressing this. DA, like Kalman filtering strategies, is commonly used to improve the estimation of the latent unknown state of a system by sequentially including the available observations (

In the classic Kalman filter, an assimilation cycle aims to sequentially update a Gaussian forecast distribution (generated by the model equations) with the information provided by the observations (also Gaussian), when available. The posterior analysis distribution is a more accurate and reliable estimation of the latent state by satisfying robust statistical properties (i.e., best linear unbiased estimator). Therefore, the forecast for the next time step benefits from this accurate initial condition. The ensemble Kalman filter (EnKF) is preferred when dealing with nonlinear systems (

DA allows the model performance evaluation by measuring the consistency between the model forecasts and the available observations. Directly computed in a DA cycle, the contextual model evidence (CME) is defined as the log-likelihood of observations at a given time for a model, taking into account the forecast state as prior information (

In principle, DA formalism requires access to the numerical model to propagate the state of the system in order to obtain the forecasts. When performing EnKF, a large number of model forecasts are generated, representing a significant computational cost. An efficient and flexible data-driven alternative aims at combining DA with a statistical forecasting strategy based on analogs (

In practice, it consists in fitting a local linear regression (

The methodology described above is applied to an ensemble of

The CME time series can be processed in various ways to obtain a single scalar (i.e., the score) used to derive the model weight. In this study, three CME scores are defined, together with three benchmark scores (not using CME). Hence, we define

The model democracy (

The climatological score is based on the comparison of model and observation distributions. Here, there is no assimilation process, which means that this score does not evaluate the local dynamics. It measures the minimum cumulative value of the two histograms, highlighting the common area between both distributions. A score close to 0 denotes a model that poorly simulates the observed distribution, while a score of 1 is obtained by a model with a perfect climatological distribution (

The single best model score assigns 1 to the model showing the best climatological score and 0 to the others. Opposite to the democracy score, it only takes into account the best-performing model over the whole set of observations (

The CME-ClimWIP score is based on the ClimWIP approach (

The CME best punctual model score exploits the local-time performance provided by CME. It assigns a local value of 1 to the best model at time

The CME best persistent model score

Here, the methodology for one experiment consists in two successive stages (Fig.

Schematic of the two-stage methodology for one experiment using 11 models.

The study is based on an idealized autonomous low-order deterministic model of the AMOC able of reproducing its millennial variability within a chaotic dynamics (

Equation (

Numerical integrations are done using a fourth-order Runge–Kutta method with a time step of 1 year. All reference parameters are set following

Various versions of the AMOC model are obtained by perturbing some parameters in the equations (

Chaotic trajectories in the phase space of the 11 perturbed versions of the three-variable (normalized here) AMOC model (

A model-as-truth experiment framework is set up (e.g.,

Distributions of normalized AMOC for the 11 model versions.

The different methods are evaluated using two protocols. In the first one, we consider a “perfect experiment”, where the model used to generate the pseudo-observations is also included in the catalog. In the second one, the model is excluded from the catalog, and we talk about an “imperfect experiment”. For both perfect and imperfect cases, we first describe an illustrative experiment when model 8 is used as pseudo-observations and then summarized the results for each model alternately used to generate the pseudo-observations following a leave-one-out experiment design.

The goal of the perfect model approach is to measure the ability of the three CME-based scores to retrieve the correct model (by giving it a predominant weight) from a pool of catalogs, including the true one. The skills of these scores are then compared with the skills of the three benchmarks. In all the perfect model experiments of this study, the CME-ClimWIP score is computed using the optimal values of

AnDA is performed on the 11 models with the same pseudo-observations from model 8, and CME is computed for each assimilation cycle. The resulting CME series for each individual model varies significantly over time and differs locally from model to model (Fig.

AnDA results on 400 cycles using model 8 to generate the pseudo-observations: examples of three assimilated models among the 11.

CME distributions from AnDA results using model 8 to generate the pseudo-observations. CME distributions are on the

The climatology-based score produces almost uniform weights close to model democracy (Table

Perfect model results for model 8 as the pseudo-observation. For each score (in row): weights (in %) associated with the 11 models. The sum of the weights per row is 100. Values in bold highlight the model with the highest weight for each score.

By varying the model used to generate the pseudo-observations in the 11 AnDA experiments, we can robustly assess the ability of the six scores to efficiently retrieve the correct model (Table

Summarized perfect model results of the leave-one-out model-as-truth experiments. For each column representing an experiment (i.e., the model index used to generate pseudo-observations), the index of the model with the highest weight is specified for each score in the row. The indices in bold show when the true model is recovered. The last column summarizes how frequently the correct model is identified among the candidate models across the 11 experiments for the five scores. Note that “N/A” is indicated for model democracy since the score prevents differentiation between models.

The imperfect model approach aims to reconstruct the statistical properties of the distribution of a missing model using distinct models. In all the imperfect experiments of this study, the CME-ClimWIP score is computed using the optimal values of

The model democracy approach yields the worst reconstruction (Fig.

Imperfect model results of the model 8 experiment (by excluding it).

When the 11 models are reconstructed independently in the imperfect framework, the three CME-based scores produce better reconstructions on average than the democracy and climatologies scores (Fig.

The reconstruction performances in the 11 experiments show that CME-ClimWIP greatly outperforms the model democracy for 7 of 11 model reconstructions. As for model 8 reconstruction, CME-ClimWIP gives higher weights to a small subset of models, making it well suited to reconstructing distributions that share similarities with some others in the ensemble.

On the other hand, CME-ClimWIP is close but less useful than democracy (and climatologies) for reconstructing models 3, 5, 9, and 10. Here, despite the six scores having lost performance, uniform weights are appropriate for good reconstruction. A typical example of such case is the reconstruction of model 10 whose distribution is quite symmetric, whereas all the other models have asymmetric distribution. For model 3, 5, 9, and 10 experiments, CME best punctual and persistent give better results than CME-ClimWIP. It is worth noting that these two scores also have greater reliability across all 11 experiments than CME-ClimWIP, since their reconstruction performance remains consistently superior to the model democracy score, with minimal variation between all the experiments.

Imperfect model results (i.e., excluding the true model) of the leave-one-out model-as-truth experiments.

As expressed in Eq. (

Sensitivity of the reconstruction performance (based on overlap of the reconstruction distribution with the truth) to various alternative/simplified versions of the ClimWIP score (first column) in the imperfect model approach. The second column shows the overlap associated with the model 8 experiment. The third column shows the results of all leave-one-out experiments combined.

The degree of dependence between models does not contribute significantly to the reconstruction performance (Table

This study aims at developing a data-driven methodology for weighting models, only based on their ability to represent the dynamics of observations. For this purpose, a set of dynamical models is compared to noisy observations, where model equations are replaced by available simulations. The method combines a machine learning approach (i.e., an analog forecasting method) to estimate the model forecasts in a cost-effective manner (i.e., without the need to re-run the model) and a sequential data assimilation algorithm (i.e., the ensemble Kalman filter). The time-varying performance of the models with respect to observations is evaluated using the contextual model evidence (also known as innovation likelihood), which benefits from DA properties.

To test this methodology, an ensemble of 11 models is obtained by perturbing parameters of an idealized chaotic model of the AMOC. For each model version, the equations are only used to generate large simulations of its three variables. Each version alternately plays the role of truth (i.e., leave-one-out experiments) used to construct pseudo-observations of the AMOC component. In the 11 experiments, model weights are extracted from the CME series using different strategies. They take into account the performance of the models with respect to pseudo-observations and the degree of similarity with other models. The method is then assessed by applying the weights to the distributions of long-term model simulations. In this way, we test the extent to which the short-term dynamics of individual models can provide relevant information for reconstructing the statistics of a targeted distribution. Reconstruction performance is measured by the percentage of overlap between the reconstructed and the true distributions. The reconstruction performance associated with the three CME-based scores is compared to three benchmark approaches that do not include DA framework (i.e., model democracy, climatological distributions comparison, and best single model).

The results of the perfect model approach highlight a better performance of CME-based scores in recovering the correct model, compared to benchmarks, suggesting the importance of using DA. The results of the benchmark strategies generally suffer from a lack of discrimination between models to correctly identify the right one. In the context of the imperfect model approach, the scores based on CME are able to reconstruct the targeted distribution more suitably than the benchmarks using the same partial noisy observations. This emphasizes the valuable information contained within the short-term dynamics, rather than the general information provided by climatological statistics, enabling efficient differentiation between models. The results underline that CME-based scores can be seen as a compromise between the “democracy score” and the so-called “dictatorial score” which selects only a single model. Within the CME-based score, the CME-ClimWIP score is relatively closer to “dictatorship”. It is suitable for reconstructing distributions sharing similarities with few models in the ensemble and when democracy is not appropriate. On the other hand, the CME best punctual and CME persistent scores are closer to “democracy”. By exploiting temporal performance, their weights are more adaptive, which is advantageous for outperforming the already successful democracy in any experiment (when CME-ClimWIP does not succeed).

An inherent assumption of our study is the Gaussianity of the EnKF forecast distributions. Testing its validity through a Jarque–Bera test (

In the current study, the weights are fixed and do not change over time, as long-term averages of stationary series are reconstructed. From a methodological standpoint, there is potential to better leverage the local-time properties of CME by computing time-dependent weights. This would be especially relevant in a non-autonomous framework (e.g., in the context of climate changes).

This study initiates the implementation of the methodological framework, whose ultimate goal is to weight a larger ensemble of dynamic models representing more complex systems, such as those provided by the CMIP6 (

Including CME as a metric in the CME-ClimWIP expression brings the score into the Bayesian model averaging (BMA) framework, which relies on marginal likelihoods. Specifically,

Python codes developed for the current study are available in the GitHub open repository (

All authors contributed equally to the experimental design. PLB wrote the article and performed the numerical experiments and the analyses. FS, PT, JR, and PA helped with the redaction of the paper.

At least one of the (co-)authors is a member of the editorial board of

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

The authors acknowledge the financial support from the Région Bretagne and from the ISblue project (Interdisciplinary graduate School for the blue planet, NR-17-EURE-0015). The authors also want to thank Maxime Beauchamp and Noémie Le Carrer for their helpful comments on the study.

This research has been supported by the AMIGOS project from the Région Bretagne (ARED fellowship for PLB) and the ISblue project (Interdisciplinary graduate School for the blue planet, ANR-17-EURE-0015), co-funded by a grant from the French government under the Investissements d'Avenir program.

This paper was edited by Wansuo Duan and reviewed by Jie Feng and one anonymous referee.