Small forecast ensemble sizes (

Geophysical forecast models are often computationally expensive to run. As a result, geophysical ensemble data assimilation (EnsDA) typically uses

Several types of ensemble expansion methods have been proposed in the literature, all of which have strengths and weaknesses. The first type is random draws from climatology

An alternative type of ensemble expansion method is to aggregate forecast ensemble members across time

A third type of ensemble expansion method is to search a historical catalog for forecast states similar to the current forecast or observations

Ensemble modulation

The shortcomings of existing ensemble expansion methods motivate the development of a new ensemble expansion method. This study proposes a new ensemble expansion method that explicitly utilizes the users' knowledge of prior marginals, the Probit-space Ensemble Size Expansion for Gaussian Copulas (PESE-GC). PESE-GC constructs virtual members using a generalization of the efficient and scalable Gaussian resampling algorithm of

An illustration of how PESE-GC can be integrated into a typical EnsDA cyclic workflow. This workflow is meant to be read starting from the green box labeled START. The arrows indicate the movement of various kinds of information (see legend). For example, the fat orange arrows indicate that the virtual members are created by PESE-GC (red polygon), passed to the observation operators (rounded purple box), passed to the EnsDA algorithm (rounded brown box), and then removed before applying the forecast model (black polygon). Obs stands for observation and ens stands for ensemble.

The remainder of this publication is divided into five sections. Section

This section begins with reviewing the CAC2020 algorithm. The CAC2020 algorithm is then generalized to handle arbitrary piecewise continuous marginal distributions (i.e., one-dimensional distributions) using probit probability integral (PPI) transforms. Finally, the computational complexity and scalability of PESE-GC is discussed.

The CAC2020 algorithm constructs Gaussian-distributed virtual members through linear combinations of the forecast ensemble perturbations. The resulting expanded ensemble has the same mean state and covariance matrix as the forecast ensemble. The CAC2020 algorithm was first formulated by

To write down the CAC2020 algorithm, a notation similar to

The CAC2020 algorithm constructs

The CAC2020 algorithm's second step is to generate

In the third and final step, the CAC2020 algorithm generates

Note that this study's (and the CAC2020's)

To illustrate the issue with the CAC2023's

Plots demonstrating the impacts of various heuristic choices in the formulation of PESE-GC. Panels

The CAC2020 algorithm is efficient and scales well with parallel computing. This is because steps 2 and 3 of the CAC2020 algorithm (Sect.

The CAC2020 algorithm also produces expanded ensembles with same ensemble means and ensemble covariances as the forecast ensembles. In other words, the rank of the expanded ensemble's covariance matrix is the same as that of the forecast ensemble. Future work can explore ways to incorporate localization into the expanded ensemble's covariance matrix.

Furthermore, the CAC2020 algorithm always generates Gaussian-distributed virtual members; even if the actual forecast distribution is highly non-Gaussian, the virtual members' distribution will still be Gaussian. The CAC2020 algorithm thus degrades the ensemble statistics in situations where the forecast distribution is non-Gaussian. This degradation limits the usefulness of the CAC2020 algorithm for situations with non-Gaussian forecast distributions.

Note that, except for the mean and covariance, the expanded ensemble's central moments (i.e., higher-order moments; e.g., skewness) likely differ from the forecast ensemble's. More specifically, the expanded ensemble's central moments will be closer to those of Gaussian distributions (e.g., zero skewness) than the forecast ensemble's central moments. This is because the virtual members are effectively drawn from a Gaussian distribution. If the forecast distribution is indeed a Gaussian distribution, then the expanded ensemble likely has better moments than the forecast ensemble.

The CAC2020 algorithm is limited to generating Gaussian-distributed virtual members. PESE-GC overcomes this limitation by combining probit probability integral (PPI) transforms and their inverses with the CAC2020 algorithm. A PPI transform transforms any univariate distribution with a continuous CDF into a standard normal distribution, and the inverse PPI transform reverses the process. The quantity resulting from applying the PPI transform on a random variable is called “probit” and the coordinate space occupied by probits is called “probit space”. Such transforms are often used in Gaussian anamorphosis

Illustrations of PESE-GC’s four-stage algorithm. Panels

To define the PPI transform, suppose

The PPI transform generalizes the CAC2020 algorithm to handle non-Gaussian forecast ensembles. The resulting PESE-GC procedure has four stages and is illustrated in Fig.

For each model state variable, fit a user-specified univariate distribution to that model variable in the forecast ensemble (i.e., marginal distribution fitting).

For each model state variable, apply the PPI transform (Eq.

For each model state variable, adjust the mean and variance of that variable's forecast ensemble probits to zero and unity, respectively (explained in Sect.

For each model state variable, apply the inverse PPI transform (Eq. (

Note that this four-stage procedure assumes that the multivariate forecast distribution is Gaussian in probit space. This assumption arises from the use of a Gaussian resampling algorithm (the CAC2020 algorithm) to generate virtual probits. This assumption is equivalent to assuming that the multivariate forecast distribution has a Gaussian copula. As such, this four-stage procedure is called Probit-space Ensemble Size Expansion for Gaussian Copulas.

PESE-GC's four-stage procedure is attractive for geophysical EnsDA for several reasons. Aside from the fact that it can generate non-Gaussian virtual members, PESE-GC can be implemented in an embarrassingly parallel fashion (every loop over the model state variables is embarrassingly parallel). Furthermore, PESE-GC is likely affordable for geophysical EnsDA because the CAC2020 algorithm (stage 3) is efficient (see Sect.

Note that the quality of the virtual members depends on the distributions the user selects in step 1 of PESE-GC. This will be discussed later in Sect.

PESE-GC requires forecast ensemble probits with zero mean and unity variance. Otherwise, the resulting virtual members will disobey the marginal distributions fitted in PESE-GC's step 1. However, because the forecast ensemble size is finite, the forecast ensemble's probits may have non-zero mean and non-unity variance. To illustrate, suppose PESE-GC is applied to five univariate forecast ensemble members (red crosses in Fig.

This problematic disagreement is resolved by adjusting the forecast probits' mean and variance to zero and unity (respectively) before generating the virtual members. Suppose the probit of the

To understand the influence of PESE-GC on EnsDA, consider a joint model–observation space formulation of Bayes' rule:

Plots of the true forecast PDF

This section explores the influence of PESE-GC on EnsDA through those two mechanisms. A bivariate example is used to illustrate those mechanisms. Suppose a scalar forecast model variable

For certain EnsDA algorithms that employ ensemble-based representations of the observation likelihood function, PESE-GC can improve those representations (impact mechanism 1). Two EnsDA algorithms will be considered: (1) the rank histogram filter (RHF;

Bivariate example demonstrating the impacts of drawing virtual members from an informative fitted marginal (normal distribution). Panel

Bivariate example demonstrating the impacts of drawing virtual members from a misinformed fitted marginal (gamma distribution). The panels here are similar to Fig.

Impact mechanism 1 also manifests for the serial stochastic EnKF. To see that, consider a situation with two observations and recall that the serial stochastic EnKF uses random draws from a univariate Gaussian distribution to represent the likelihood function (one draw per ensemble member). For small ensembles, only a few of those random draws are made. In other words, there are sampling errors in representing the likelihood function. The ensemble statistics resulting from assimilating the first observation are thus degraded by those sampling errors. This degradation then affects the assimilation of the second observation. The assimilation of more than two observations compounds such sampling issues. Since PESE-GC increases the ensemble size, more draws from the likelihood function are made, thus suppressing sampling errors. As such, in the absence of other factors, PESE-GC will improve the performance of the serial stochastic EnKF.

Note that many EnsDA algorithms are immune to impact mechanism 1. The deterministic variants of the EnKF

If the user specifies appropriate marginals in PESE-GC's stage 1, then PESE-GC will improve the ensemble statistics used by EnsDA algorithms (mechanism 2). This will be illustrated using the bivariate five-member example discussed near the start of Sect.

An important caveat is that if the user selects misinformed marginal distributions, then PESE-GC may degrade the ensemble statistics used by EnsDA algorithms. To illustrate, suppose the user fits a shifted gamma distribution

In the absence of knowledge about the model variables' prior marginal distribution, users can use non-parametric marginal distributions with PESE-GC. Such distributions include the Gaussian-tailed rank histogram distribution

Note that when linear observations are assimilated, PESE-GC with Gaussian marginals will not change the performance of deterministic variants of the EnKF

This section explores the impacts of PESE-GC on the performance of EnsDA using perfect model Observing System Simulation Experiments (OSSEs) with the Lorenz 1996 model (L96 model;

The L96 model uses 40 variables (i.e., 40 grid points in a ring), a forcing parameter value of 8 (i.e.,

In all experiments, there are 40 observations. Their observation locations are fixed throughout this study. Supposing that the model grid points have locations

PESE-GC's impacts are examined using EnsDA experiments that are conducted with four

The PESE-GC-expanded ensemble sizes are specified in terms of factors: 5, 10, and 20 times the forecast ensemble size. For example, if

Supposing the

The four EnsDA algorithms tested with PESE-GC are

the ensemble adjustment Kalman filter (EAKF;

the serial stochastic EnKF with sorted observation increments (EnKF;

the rank histogram filter with linear regression (RHF;

the rank histogram filter with probit regression (PR;

For each EnsDA algorithm, only one set of marginals is used with PESE-GC. When PESE-GC is used with the EAKF, EnKF, or RHF algorithm, Gaussian marginals are selected for all 40 model variables. PESE-GC with Gaussian marginals is identical to the CAC2020 algorithm. In other words, for the EAKF, EnKF, and RHF experiments, the virtual ensemble members follow multivariate Gaussian distributions. For the PR algorithm, the Gaussian-tailed rank histogram is selected as the marginal for every one of the 40 model variables. This means the PR experiments' virtual ensemble members follow multivariate non-Gaussian distributions. Future work can investigate the impacts of using PESE-GC with Gaussian-tailed rank histograms (or kernel density estimates) with the EAKF, EnKF, and RHF.

Each of the 1440 configurations is trialed 36 times. These trials are enumerated (Trial 1, Trial 2, and so forth). All experiments with the same trial number and

The Gaspari–Cohn fifth-order rational function

The inflation scheme used here is identical to the one used by

The impacts of PESE-GC on EnsDA are assessed using the relative difference between cycle-averaged RMSEs (Eq.

Only statistically significant trial-averaged

Statistically significant

Similar to Fig.

Similar to Fig.

Before proceeding, note that PESE-GC with Gaussian marginals only negligibly changes the performance of the EAKF with IDEN observations (Fig.

This study first examines the

The first common

The RMSE reductions seen in the 10/20-member EAKF, EnKF, and RHF experiments are also partly due to improved ensemble statistics (i.e., mechanism 2; Sect.

The second common pattern in the 20-fold PESE-GC experiments is that with increasing

For the EAKF, EnKF, and RHF experiments, impact mechanism 2 also contributes to the worsening of PESE-GC's RMSE impacts with increasing

A third common pattern is that with longer cycles, the PESE-GC's RMSE impacts on the EAKF, EnKF, and RHF degrade (i.e.,

The fourth common pattern is that in the PR experiments, for

Note that the chain of events discussed in the previous paragraph likely occurs for the RHF experiments as well. Since the RHF experiments do not exhibit the fourth common pattern, it is likely that the inappropriateness of the Gaussian marginals used with PESE-GC overwhelms improvements introduced by refining the piecewise approximation.

This study now examines common patterns in how PESE-GC's impacts vary with PESE-GC expansion factors. The first common pattern is that PESE-GC's impacts on the PR experiments tend to weaken with smaller PESE-GC factors (panels a4, b4, and c4 in Figs.

The second common pattern is that, for

An interesting third common pattern is also visible in the EAKF, EnKF, and RHF experiments: there are instances where reducing PESE-GC factors (1) turns insignificant RMSE impacts into RMSE improvements (e.g., the lower-left corner of Fig.

Most importantly, even with a mere fivefold PESE-GC, PESE-GC improves the performance of EnsDA in three types of situations. First, all EnsDA experiments involving small forecast ensemble sizes (10 members) are improved by PESE-GC. Second, situations where using Gaussian marginals with PESE-GC improves ensemble statistics are also improved by PESE-GC. This second type of situation occurs for the EAKF, EnKF, and RHF experiments that have either (1) 20–40 ensemble members and/or (2) cycling intervals that are 0.30

The results presented in the previous section are encouraging. However, a caveat about PESE-GC needs discussion: PESE-GC assumes that the forecast distribution is a multivariate Gaussian distribution in probit space (i.e., Gaussian copula). If that assumption is violated (henceforth Gaussian copula assumption), the virtual members will possess statistical artifacts.

Two bivariate demonstrations of PESE-GC. In each demonstration, 100 initial members are drawn from a true bivariate PDF

Figure

Since the forecast PDF in Fig.

An example where the Gaussian copula assumption is violated is shown in Fig.

Applying PPI transforms on this bivariate bi-Gaussian forecast PDF reveals that the bi-Gaussian PDF violates the Gaussian copula assumption (Fig.

Note that though the virtual members' PDF deviates from the forecast PDF, a strong similarity exists between the two PDFs. The two dominant modes of the virtual members' PDF are very similar to the bi-Gaussian forecast PDF. More generally, milder violations of the Gaussian copula assumption will likely lead to milder spurious statistical features in the virtual members.

More importantly, PESE-GC's Gaussian copula assumption may not be problematic for geophysical EnsDA. Due to the high dimensionality of geophysical models and small forecast ensemble sizes, it is difficult to identify the family of the multivariate forecast distributions in probit space. In other words, the forecast ensemble's statistics in probit space are likely indistinguishable from a multivariate Gaussian. This indistinguishability permits assuming Gaussian copulas. Future work can investigate this possibility.

It is also important to discuss the impacts of PESE-GC on the EnsDA process (i.e., the forecast step and analysis step). Since the virtual members are deleted before running forecast models (Fig.

However, the increase in the computational cost associated with PESE-GC is likely far more affordable than running a larger forecast ensemble. This is because the computational cost of the forecast step often accounts for

In this study, an efficient and embarrassingly parallel algorithm to increase ensemble sizes, PESE-GC, is formulated. PESE-GC generalizes the efficient and embarrassingly parallel Gaussian resampling algorithm of

Three mechanisms are then identified for PESE-GC to influence the performance of EnsDA. First, for EnsDA methods like the serial stochastic EnKF and the rank histogram filter, PESE-GC improves the representation of the observation likelihood function. Second, by expanding the number of ensemble members, PESE-GC increases the sampling of the observation operator. This increased sampling improves the forecast observations' PDF. Finally, when users use PESE-GC with informative marginal distribution families, the forecast observations' statistics are improved.

The impacts of PESE-GC on the performance of EnsDA are explored using the L96 model, a variety of observation systems and a variety of EnsDA algorithms. Results indicate that PESE-GC generally improves the performance of EnsDA when (1) the forecast ensemble size is small (10 members), (2) the marginal distribution families used with PESE-GC are informative, and/or (3) PESE-GC improves the representation of the observation likelihood function (the PR experiments in Sect.

There are two general areas for future work with PESE-GC. The first area is to move PESE-GC towards geophysical models (EnsDA or forecast postprocessing). To do so, PESE-GC needs to first be tested with ensemble members created by geophysical models (e.g., Weather Research and Forecasting model;

Another general area for future work is to develop the PESE-GC algorithm further. First, given the importance of localization in practical EnsDA, future work can and should explore inserting localization into PESE-GC. Second, the validity of PESE-GC's Gaussian copula assumption can be assessed in the context of geophysical modeling and forecasting. If the Gaussian copula assumption is inappropriate, then non-parametric methods to generate virtual probits can be explored. Third, methods to detect the usage of misinformed parametric marginal distribution families deserve exploration. One possible detection method is to employ hypothesis testing on the marginal distributions. For example, if Gaussian distributions are selected for PESE-GC, then the Shapiro–Wilk test can be applied on the forecast ensemble to determine if the selection is misinformed (e.g.,

The computational cost of running geophysical models will continue to increase in the coming years (higher spatial resolution, shorter time steps, more complex parameterization schemes, etc.). Geophysical EnsDA groups will continue to grapple with the challenge of balancing the computational costs of increasing the number of forecast ensemble members and the computational costs of using more realistic geophysical models. If ensemble expansion methods can provide much of the benefits of a larger forecast ensemble size at a fraction of the cost, these methods will enable EnsDA groups to employ more realistic geophysical models.

The codes used in this study is publicly available at

Due to the immense number of experiments performed in this study (881 280 experiments), only the performance metrics of each experiment are archived. These performance metrics are consolidated into text files that are available at

The supplement related to this article is available online at:

The author has declared that there are no competing interests.

Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation. Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

The author is eternally grateful to Jeffrey L. Anderson and the National Center for Atmospheric Research's Data Assimilation Research Section for useful discussions and guidance. The author thanks Mohamad El Gharamti and Craig Schwartz for helping to improve the explanation of the PESE-GC algorithm. Furthermore, the author thanks the participants of the International Symposium for Data Assimilation 2023 (ISDA), Alberto Ortolani in particular, for discussions that further clarified the author's thinking process and explanations. The author is also grateful to Yao Zhu and Christopher Hartman for checking the readability of this paper. Finally, the author would like to thank Olivier Talagrand (the editor), Ian Grooms (reviewer), and Lili Lei (reviewer) for their thorough review of this paper and for their constructive feedback.

This study is supported by the Advanced Study Program Postdoctoral Fellowship at the National Center for Atmospheric Research (NCAR) and The Ohio State University. NCAR is sponsored by the National Science Foundation. All computations in this study are done on two NCAR computing clusters: Casper and Cheyenne. These clusters are managed by NCAR's Computational and Information Systems Laboratory.

This research has been supported by the National Science Foundation (grant no. 1852977).

This paper was edited by Olivier Talagrand and reviewed by Ian Grooms and Lili Lei.