Simulation-based comparison of multivariate ensemble post-processing methods

. Many practical applications of statistical postprocessing methods for ensemble weather forecasts require accurate modeling of spatial, temporal, and inter-variable dependencies. Over the past years, a variety of approaches has been proposed to address this need. We provide a compre-hensive review and comparison of state-of-the-art methods for multivariate ensemble post-processing. We focus on generally applicable two-step approaches where ensemble predictions are ﬁrst post-processed separately in each margin and multivariate dependencies are restored via copula functions in a second step. The comparisons are based on simulation studies tailored to mimic challenges occurring in practical applications and allow ready interpretation of the effects of different types of misspeciﬁcations in the mean, variance, and covariance structure of the ensemble forecasts on the performance of the post-processing methods. Overall, we


Introduction
Despite continued improvements, ensemble weather forecasts often exhibit systematic errors that require correction via statistical post-processing methods. Such calibration approaches have been developed for a wealth of weather variables and specific applications. The employed statistical techniques include parametric distributional regression models Raftery et al., 2005) as well as nonparametric approaches (Taillardat et al., 2016) and semiparametric methods based on modern machine learning techniques (Rasp and Lerch, 2018). We refer to Vannitsem et al. (2018) and Vannitsem et al. (2020) for a general overview and review.
While many of the developments have been focused on univariate methods, many practical applications require one to accurately capture spatial, temporal, or inter-variable dependencies (Schefzik et al., 2013). Important examples include hydrological applications (Scheuerer et al., 2017), air traffic management (Chaloulos and Lygeros, 2007), and energy forecasting (Pinson and Messner, 2018). Such dependencies are present in the raw ensemble predictions but are lost if standard univariate post-processing methods are applied separately in each margin.
Over the past years, a variety of multivariate postprocessing methods has been proposed; see Schefzik and Möller (2018) for a recent overview. Those can roughly be categorized into two groups of approaches. The first strategy aims to directly model the joint distribution by fitting a specific multivariate probability distribution. This approach is mostly used in low-dimensional settings or if a specific structure can be chosen for the application at hand. Examples include multivariate models for temperatures across space (Feldmann et al., 2015), for wind vectors (Schuhen et al., 2012;Lang et al., 2019), and joint models for temperature and wind speed Möller, 2015, 2017).
The second group of approaches proceeds in a two-step strategy. In a first step, univariate post-processing methods are applied independently in all dimensions, and samples are generated from the obtained probability distributions. In a second step, the multivariate dependencies are restored by re-arranging the univariate sample values with respect to the rank order structure of a specific multivariate dependence template. Mathematically, this corresponds to the application of a (parametric or non-parametric) copula. Examples include ensemble copula coupling (Schefzik et al., 2013), the Schaake shuffle (Clark et al., 2004), and the Gaussian copula approach (Möller et al., 2013). 1 Here, we focus on this second strategy, which is more generally applicable in cases where no specific assumptions about the parametric structure can be made or where the dimensionality of the forecasting problem is too high to be handled by fully parametric methods. The overarching goal of this paper is to provide a systematic comparison of state-ofthe-art methods for multivariate ensemble post-processing. In particular, our comparative evaluation includes recently proposed extensions of the popular ensemble copula coupling approach (Hu et al., 2016;Ben Bouallègue et al., 2016). We propose three simulation settings which are tailored to mimic different situations and challenges that arise in applications of post-processing methods. In contrast to case studies based on real-world datasets, simulation studies allow one to specifically tailor the multivariate properties of the ensemble forecasts and observations and to readily interpret the effects of different types of misspecifications on the forecast performance of the various post-processing methods. Simulation studies have been frequently applied to analyze model properties and to compare modeling approaches and verification tools in the context of statistical post-processing; see, e.g., Williams et al. (2014), Thorarinsdottir et al. (2016), Wilks (2017), and Allen et al. (2019).
The remainder is organized as follows. Univariate and multivariate post-processing methods are introduced in Sect. 2. Section 3 provides descriptions of the three simulation settings, with results discussed in Sect. 4. The pa-1 An alternative post-processing approach that allows one to preserve multivariate dependencies is the member-by-member method proposed by Van Schaeybroeck and Vannitsem (2015). Schefzik (2017) demonstrates that member-by-member post-processing can be interpreted as a specific variant of ensemble copula coupling and can thus be seen as belonging to this group of methods. per closes with a discussion in Sect. 5. Technical details on specific probability distributions and multivariate evaluation methods are deferred to the Appendix. Additional results are available in the Supplement. R (R Core Team, 2019) code with replication material and implementations of all methods is available from https://github.com/slerch/multiv_pp (last access: 10 June 2020).

Post-processing of ensemble forecasts
We focus on multivariate ensemble post-processing approaches which are based on a combination of univariate post-processing models with copulas. The general two-step strategy of these methods is to first apply univariate postprocessing to the ensemble forecasts for each margin (i.e., weather variable, location, and prediction horizon) separately. Then, in a second step, a suitably chosen copula is applied to the univariately post-processed forecasts in order to obtain the desired multivariate post-processing, taking account of dependence patterns.
A copula is a multivariate cumulative distribution function (CDF) with standard uniform univariate marginal distributions (Nelsen, 2006). The underlying theoretical background of the above procedure is given by Sklar's theorem (Sklar, 1959), which states that a multivariate CDF H (this is what we desire) can be decomposed into a copula function C modeling the dependence structures (this is what needs to be specified) and its marginal univariate CDFs F 1 , . . ., F d (this is what is obtained by the univariate post-processing) as follows: for x 1 , . . ., x d ∈ R. In the approaches considered here, the copula C is chosen to be either the non-parametric empirical copula induced by a pre-specified dependence template (in the ensemble copula coupling method and variants thereof as well as in the Schaake shuffle) or the parametric Gaussian copula (in the Gaussian copula approach). A Gaussian copula is a particularly convenient parametric model, as apart from the marginal distributions it only requires estimation of the correlation matrix of the multivariate distribution. Under a Gaussian copula the multivariate CDF H takes the form with d ( · | ) denoting the CDF of a d-dimensional normal distribution with mean zero and correlation matrix and −1 denoting the quantile function of the univariate standard normal distribution.
To describe the considered methods in more detail in what follows, let X 1 , . . ., X m ∈ R d denote unprocessed ensemble forecasts from m members, where X i := (X (1) i , . . ., X (d) i ) for i = 1, . . ., m, and let y := (y (1) , . . ., y (d) ) ∈ R d be the corresponding verifying observation. We will use l = 1, . . ., d to denote a multi-index that may summarize a fixed weather variable, location, and prediction horizon in practical applications to real-world datasets.

2.1
Step 1: univariate post-processing In a first step, univariate post-processing methods are applied to each margin l = 1, . . ., d separately. Prominent state-of-the-art univariate post-processing approaches include Bayesian model averaging (Raftery et al., 2005) and ensemble model output statistics (EMOS; . In the EMOS approach, which is employed throughout this paper, a non-homogeneous distributional regression model θ is a suitably chosen parametric distribution with parameters θ (l) := g(X (l) 1 , . . ., X (l) m ) that depend on the unprocessed ensemble forecast through a link function g(·).
The choice of F (l) θ is in practice mainly determined by the weather variable being considered in the margin l. For instance, when F (l) θ can be assumed to be Gaussian with mean µ and variance σ 2 , such as for temperature or pressure, one may set if the ensemble members are exchangeable, with X and S 2 denoting the empirical mean and variance of the ensemble predictions X (l) 1 , . . ., X (l) m , respectively. The coefficients a 0 , a 1 , b 0 and b 1 are then derived via suitable estimation techniques using training data consisting of past ensemble forecasts and observations .

2.2
Step 2: incorporating dependence structures using copulas to obtain multivariate post-processing When applying univariate post-processing for each margin separately, multivariate (i.e., inter-variable, spatial, and/or temporal) dependencies across the margins are lost. These dependencies are restored in a second step. Here, we consider five different approaches to do so. An overview of selected key features is provided in Table 1. For further discussion of advantages and shortcomings, as well as comparisons of subsets of these methods, see, e.g., Schefzik et al. (2013);Wilks (2015). In the following we use z to denote univariate quantities in the individual dimensions. Z in bold print is used to represent vector-valued quantities, and Z in normal print is used for components thereof.

Assumption of independence (EMOS-Q)
Instead of modeling the desired dependencies in any way, omitting the second step corresponds to assuming indepen-dence across the margins. To that end, a univariate samplê x (l) 1 , . . .,x (l) m is generated in each margin by drawing from the post-processed forecast distribution F (l) θ , l = 1, . . ., d. The univariate samples are then simply combined into a corresponding vector. Following Schefzik et al. (2013), we use equidistant quantiles of F (l) θ at levels 1 m+1 , . . ., m m+1 to generate the sample and denote this approach by EMOS-Q.

A samplex
m to simplify notation, of the same size m as the unprocessed ensemble is drawn from each post-processed predictive marginal distribution F (l) θ , l = 1, . . ., d.

2.
The sampled values are rearranged in the rank order structure of the raw ensemble; i.e., the permutation σ l of the set {1, . . ., m} defined by σ l (i) = rank(X (l) i ), with possible ties resolved at random, is applied to the postprocessed sample from the first step in order to obtain the final ECC ensembleX Depending on the specific sampling procedure in Step 1, we here distinguish the following different ECC variants.
-ECC-Q. The sample is constructed using equidistant quantiles of F (l) θ at levels 1 m+1 , . . ., m m+1 : -ECC-S (Hu et al., 2016). First, random numbers Besides the above sampling schemes, Schefzik et al. (2013) propose an alternative transformation approach referred to as ECC-T. This variant is in particular appealing for theoretical considerations, as it provides a link between the ECC notion and member-by-member post-processing approaches (Schefzik, 2017). However, as it may involve additional modeling steps, ECC-T is not as generic as the other schemes and thus not explicitly considered here.

A transformation based on an estimate of the error au-
tocorrelationˆ e is applied to the bias-corrected postprocessed forecast in order to obtain correction terms c 1 , . . ., c m . Precisely, c i := (ˆ e ) 1 2 · (X i − X i ) for i = 1, . . ., m.
4. ECC-Q is applied again, but now performing the reordering with respect to the rank order structure of the adjusted ensemble from Step 3 used as a modified dependence template.

Schaake shuffle (SSh)
The Schaake shuffle (SSh) proceeds like ECC-Q, but reorders the sampled values in the rank order structure of m past observations (Clark et al., 2004) and not with respect to the unprocessed ensemble forecasts. For a better comparison with (d)ECC, the size of SSh ensemble is restricted to equal that of the unprocessed ensemble here. However, in principle, the SSh ensemble may have an arbitrary size, provided that sufficiently many past observations are available to build the dependence template. Extensions of SSh that select past observations based on similarity are available (Schefzik, 2016;Scheuerer et al., 2017) but not explicitly considered here as their implementation is not straightforward and may involve additional modeling choices specific to the situation at hand. The reordering-based methods considered thus far can be interpreted as non-parametric, empirical copula approaches.
In particular, in the setting of Sklar's theorem, C is taken to be the empirical copula induced by the corresponding dependence template, i.e., the unprocessed ensemble forecasts in the case of ECC, the adjusted ensemble in the case of dECC, and the past observations in the case of SSh.

Gaussian copula approach (GCA)
By contrast, in the Gaussian copula approach (GCA) proposed by Pinson and Girard (2012) and Möller et al. (2013), the copula C is taken to be the parametric Gaussian copula. GCA can be traced back to similar ideas from earlier work in spatial statistics (e.g., Berrocal et al., 2008) and proceeds as follows. 4. Final GCA post-processed ensemble forecast X * 1 , . . ., X * m , with X * i := (X * (d) i ) for i = 1, . . ., m, is obtained via for i = 1, . . ., m and l = 1, . . ., d, with denoting the CDF of the univariate standard normal distribution. While the size of the resulting ensemble may in principle be arbitrary, it is here set to the size m of the raw ensemble.

Simulation settings
We consider several simulation settings to highlight different aspects and provide a broad comparison of the effects of potential misspecifications of the ensemble predictions on the performance of the various multivariate post-processing methods. The general setup of all simulation settings is as follows.
An initial training set of pairs of simulated ensemble forecasts and observations of size n init is generated. Postprocessed forecasts are then computed and evaluated over a test set of size n test . Therefore, n := n init + n test iterations are performed in total for all simulation settings. In the following, we set m = 50, n init = 500, n test = 1000 throughout.
To describe the individual settings in more detail, we here begin by first identifying the general structure of the steps that are performed in all settings. For each iteration t in both training and test sets (i.e., t = 1, . . ., n), multivariate forecasts and observations are generated.
For all iterations t in the test set (i.e., t = n init + 1, . . ., n), the following steps are carried out.
(S4) Compute univariate and multivariate measures of forecast performance on the test set.
Unless indicated otherwise, all simulation draws are independent across iterations. To simplify notation, we will thus typically omit the simulation iteration index t in the following.
To quantify simulation uncertainty, the above procedure is repeated 100 times for each tuning parameter combination in each setting. In the interest of brevity, we omit ECC-R, which did show substantially worse results in initial tests (see also Schefzik et al., 2013). In the following, the individual simulation settings are described in detail, and specific implementation choices are discussed.

Setting 1: multivariate Gaussian distribution
As a starting point we first consider a simulation model where observations and ensemble forecasts are drawn from multivariate Gaussian distributions. 3 This setting may for example apply in the case of temperature forecasts at multiple locations considered simultaneously. The simplicity of this model allows one to readily interpret misspecifications in the mean, variance, and covariance structures.
(S1) For iterations t = 1, . . ., n, independent and identically distributed samples of observations and ensemble forecasts are generated as follows. -Observation: The parameters and σ introduce a bias and a misspecified variance in the marginal distributions of the ensemble forecasts. These systematic errors are kept constant across dimensions 1, . . ., d. The parameters ρ 0 and ρ control the autoregressive structure of the correlation matrix of the observations and ensemble forecasts. Setting ρ 0 = ρ introduces misspecifications of the correlation structure of the ensemble forecasts.
(S2) As described in Sect. 2.1, univariate post-processing is applied independently in each dimension 1, . . ., d. Here, we employ the standard Gaussian EMOS model (2) proposed by . The EMOS coefficients a 0 , a 1 , b 0 , b 1 are estimated by minimizing the mean continuous ranked probability score (CRPS; see Appendix B) over the training set consisting of the n init initial iterations and are then used to produce out-ofsample forecasts for the n test iterations in the test set.
(S3) Next, the multivariate post-processing methods described in Sect. 2.2 are applied. Implementation details for the individual methods are as follows.
-For dECC, the estimate of the error autocorrelation e is obtained from the n init initial training iterations to compute the required correction terms for the test set.
-To obtain the dependence template for SSh, m past observations are randomly selected from all iterations preceding the current iteration t.
S. Lerch et al.: Comparison of multivariate post-processing methods -The correlation matrix required for GCA is estimated by the empirical correlation matrix based on all iterations preceding the current iteration t.
-The verification results for all methods that require random sampling (ECC-S, SSh, GCA) are averaged over 10 independent repetitions for each iteration t = n init + 1, . . ., n in the test set.
The multivariate Gaussian setting is implemented for d = 5 and all combinations of ∈ {0, 1, 3}, σ 2 ∈ {0.5, 1, 2, 5}, and ρ, ρ 0 ∈ {0.1, 0.25, 0.5, 0.75, 0.9}. As indicated above, the simulation experiment is repeated 100 times for each of the 300 parameter combinations. If the setting from above is interpreted as a multivariate model for temperatures at multiple locations, observations from the extant literature on post-processing suggest that, typically, values of σ < 1 and ρ > ρ 0 would be expected in real-world datasets.
A variant of Setting 1 based on a multivariate truncated Gaussian distribution has also been investigated. Apart from a slightly worse performance of GCA, the results are similar to those of Setting 1. We thus refer to Sect. S5 of the Supplement, where details on the simulation setting and results are provided.

Setting 2: multivariate censored extreme value distribution
To investigate alternative marginal distributions employed in post-processing applications, we further consider a simulation setting based on a censored version of the generalized extreme value (GEV) distribution. The GEV distribution was introduced by Jenkinson (1955) among others, combining three different types of extreme value distributions. It has been widely used for modeling extremal climatological events such as flood peaks (e.g., Morrison and Smith, 2002) or extreme precipitation (e.g., Feng et al., 2007). In the context of post-processing, GEV distributions have for example been applied for modeling wind speed in Lerch and Thorarinsdottir (2013). Here, we consider multivariate observations and forecasts with marginal distributions given by a left-censored version of the GEV distribution which was proposed by Scheuerer (2014) in the context of post-processing ensemble forecasts of precipitation amounts.
(S1) For iterations t = 1, . . ., n samples of observations and ensemble forecasts are generated as follows. For l = 1, . . ., d, the marginal distributions are GEV distributions left-censored at 0, where the distribution parameters µ (location), σ (scale), and ξ (shape) are identical across dimensions l = 1, . . ., d. Details on the left-censored GEV distribution are provided in Appendix A. Misspecifications of the marginal ensemble predictions are obtained by choosing different GEV parameters for observations (µ 0 , σ 0 , ξ 0 ) and forecasts (µ, σ, ξ ). Combined misspecifications of the three parameters result in more complex deviations of mean and variance (on the univariate level) compared to Setting 1. Typically there is a joint influence of the GEV parameters on mean and dispersion properties of the distribution. In order to exploit the complex behavior a variety of parameter combinations for observations and ensemble forecasts were considered.
To generate multivariate observations y = (y (1) , . . ., y (d) ) and ensemble predictions . ., m, the so-called NORTA (normal to anything) approach is chosen; see Cario and Nelson (1997) and Chen (2001). This method allows one to generate realizations of a random vector z = (z (1) , . . ., z (d) ), with specified marginal distribution functions F (l) θ , l = 1, . . ., d, and a given correlation matrix R = (Corr(z (k) , z (l) )) d k,l=1 . The NORTA procedure consists of three steps. In a first step a vector θ . The correlation matrix R * is chosen in a such a way that the z (l) have the desired target correlation matrix R. Naturally, the specification of R * is the most involved part of this procedure. Here, we use the retrospective approximation algorithm implemented in the R package NORTARA (Su, 2014). The NORTARA package infrequently produced error and warnings, which were not present for alternative starting values of the random number generator. Following the previous simulation settings the target correlation matrix R is chosen as R i,j = ρ |i−j | for −1 < ρ < 1 and i, j = 1, . . ., d.
(S2) To separately post-process the univariate ensemble forecasts, we employ the EMOS method for quantitative precipitation based on the left-censored GEV distribution proposed by Scheuerer (2014). To that end we assume −0.278 < ξ < 0.5, such that the mean ν and the variance of the non-censored GEV distribution exist, and where denotes the gamma function and γ is the Euler-Mascheroni constant. See Appendix A for comments on mean and variance of the left-censored GEV. Following Scheuerer (2014), the parameters (ν, σ, ξ ) are Here, X (l) and X The shape parameter ξ is not linked to the ensemble predictions, but is estimated along with the EMOS coefficients a 0 , a 1 , a 2 , b 0 , and b 1 . As in Scheuerer (2014), the link function refers to the parameter ν instead of µ, since it is argued that for fixed ν an increase in σ can be interpreted more naturally as an increase in uncertainty. An implementation in R is available in the ensembleMOS package (Yuen et al., 2018). For our simulation, this package was not directly invoked, but the respective functions were used as a template. As described in Sect. 2.1, univariate post-processing is applied independently in each dimension l = 1, . . ., d. The EMOS coefficients are estimated as described above over the training set consisting of the n init initial iterations and are then used to produce out-of-sample forecasts for the n test iterations in the test set.
(S3) Identical to (S3) of Setting 1, except for GCA, where we proceed differently to account for the point mass at zero. The latent standard Gaussian observationsỹ The multivariate censored extreme value setting is implemented for d = 4 and four different scenarios summarized in Table 2. The choice of dimension is motivated by the fact that preliminary analyses had revealed a heavy increase in computation time and numerical problems for values of d greater than 4. In each scenario the GEV 0 distribution parameters for the observations are chosen according to (µ 0 , ξ 0 , σ 0 ), while the parameters for the ensemble predictions are chosen according to (µ, ξ, σ ). In both cases, the correlation matrix R from above is invoked with different choices of ρ 0 and ρ from the set {0.25, 0.5, 0.75}, giving a total of 4 × 9 = 36 scenarios. Note that according to Scheuerer (2014) there is a positive probability for zero to occur when either ξ ≤ 0 or ξ > 0 and µ < σ/ξ . The scenarios from Table 2 are chosen in such a way that either one of these two conditions is met.
The scenarios from Table 2 were not chosen to mimic reallife situations in the first place, but to emulate pronounced differences in distributions and account for a variety of misspecification types. In future research a more detailed and data-based study of the properties of the GEV 0 in ensemble post-processing of precipitation is planned, which might give further insight into the correspondence (and interplay) of the GEV 0 parameters to typically occurring situations for precipitation.

Setting 3: multivariate Gaussian distribution with changes over time
In the preceding simulation settings, the misspecifications of the ensemble forecasts were kept constant over the iterations t = 1, . . ., n within the simulation experiments. However, forecast errors of real-world ensemble predictions often exhibit systematic changes over time, for example due to seasonal effects or differences in flow-dependent predictability due to variations of large-scale atmospheric conditions. Here, we modify the multivariate Gaussian simulation setting from Sect. 3.1 to introduce changes in the mean, variance, and covariance structure of the multivariate distributions of observations and ensemble forecasts. In analogy to practical applications of multivariate post-processing, the ensemble predictions and observations may be interpreted as multivariate in terms of location or prediction horizon, with changes in the misspecification properties over time.
(S1) For iterations t = 1, . . ., n, independent samples of observations and ensemble forecasts are generated as follows. -Observation: To obtain the corre- i, j = 1, . . ., d and S 0 = RR T . The covariance matrix S 0 is scaled into the corresponding correlation matrix 0 using the R function cov2cor().
In contrast to Setting 1, the misspecifications in the mean and correlation structure now include a periodic component. The above setup will be denoted by Setting 3A.
Following a suggestion from an anonymous reviewer, we further consider a variant which we refer to as Setting 3B. For iterations t = 1, . . ., n we generate independent samples of observations and ensemble forecasts as follows.
-Ensemble forecasts: with a from above. The correlations for the ensemble member forecasts thus oscillate between ρ and ρ · (1 − a).
Settings 3A and 3B differ in the variations of the mean and covariance structure over time. For both, we proceed as follows.
(S2) As in Setting 1, we employ the standard Gaussian EMOS model (2). However, to account for the changes over iterations we now utilize a rolling window consisting of pairs of ensemble forecasts and observations from the 100 iterations preceding t as a training set to obtain estimates of the EMOS coefficients. See Lang et al. (2020) for a detailed discussion of alternative approaches to incorporate time dependence in the estimation of post-processing models.
(S3) The application of the multivariate post-processing methods is identical to the approach taken in Setting 1. Note that we deliberately follow the naive standard implementations (see Sect. 2.2) here to highlight some potential issues of the Schaake shuffle in this context.

Results
In the following, we focus on comparisons of the relative predictive performance of the different multivariate postprocessing methods and apply proper scoring rules for forecast evaluation. In particular, we use the energy score (ES;  and variogram score of order 1 (VS; Scheuerer and Hamill, 2015) to evaluate multivariate forecast performance. Diebold-Mariano (DM;Diebold and Mariano, 1995) tests are applied to assess the statistical significance of the score differences between models. Details on forecast evaluation based on proper scoring rules and DM tests are provided in Appendix B. Note that proper scoring rules are often used in the form of skill scores to investigate relative improvements in predictive performance in the meteorological literature. Here, we instead follow suggestions of Ziel and Berk (2019), who argue that the use of DM tests is of crucial importance to appropriately discriminate between multivariate models.
While our focus here is on multivariate performance, we briefly demonstrate that the univariate post-processing models applied in the different simulation settings usually work as intended.

Univariate performance
The univariate predictive performance of the raw ensemble forecasts in terms of the CRPS is improved by the application of univariate post-processing methods across all parameter choices in all simulation settings. The magnitude of the relative improvements by post-processing depends on the chosen simulation parameters: exemplary results are shown in Fig. 1. The results for Setting 2 are omitted as they vary more and strongly depend on the simulation parameters.
ECC-Q does not change the marginal distributions; the univariate forecasts are thus identical to solely applying univariate post-processing methods in the margins separately, without accounting for dependencies. We will later refer to this as EMOS-Q. Note that for ECC-S and SSh differences in the univariate forecast distributions compared to those of ECC-Q may arise from randomly sampling the quantile levels in ECC-S and from random fluctuations due to the 10 random repetitions that were performed to account for the simulation uncertainty of those methods. However, we found the effects on the univariate results to be negligible and omit ECC-S, dECC, and SSh from Fig. 1. For the simulation parameter values summarized there, univariate post-processing works as intended with statistically significant improvements over the raw ensemble forecasts. Note that for GCA the univariate marginal distributions are modified due to the transformation step in Eq. (4). While the quantile forecasts of ECC-Q are close to optimal in terms of the CRPS (Bröcker, 2012), (randomly sampled) univariate GCA forecasts do not possess this property, resulting in worse univariate performance compared to all other methods.

Multivariate performance
We now compare the multivariate performance of the different post-processing approaches presented in Sect. 2.2. Multivariate forecasts obtained by only applying the univariate post-processing methods without accounting for dependencies (denoted by EMOS-Q) as well as the raw ensemble predictions (ENS) are usually significantly worse and will be omitted in most comparisons below unless indicated otherwise. Additional figures with results for all parameter combinations in all settings are provided in the Supplement.

Setting 1: multivariate Gaussian distribution
The tuning parameter governing the bias in the mean vector of the ensemble forecasts only has very limited effects on the relative performance of the multivariate post-processing methods. To retain focus we restrict our attention to = 1. Figure 2 shows results in terms of the ES for two different choices of σ , using multivariate forecasts of ECC-Q as the reference method. For visual clarity, we omit parameter combinations where either ρ ∈ {0.1, 0.9} or ρ 0 ∈ {0.1, 0.9}. Corresponding results are available in the Supplement. Note that the relative forecast performance of all approaches except for dECC generally does not depend on σ . We thus proceed to discuss the remaining approaches first and dECC last.
If the correlation structure of the unprocessed ensemble forecasts is correctly specified (i.e., ρ = ρ 0 ), no significant differences can be detected between ECC-Q, ECC-S, and SSh. In contrast, GCA (and dECC for larger values of σ ) performs substantially worse. The worse performance of GCA might be due to the larger forecast errors in the univariate margins; see Sect. 4.1.
In the cases with misspecifications in the correlation structure (i.e., ρ = ρ 0 ), larger differences can be detected among all the methods. Notably, SSh never performs substantially worse than ECC-Q and is always among the best performing approaches. This is not surprising as the only drawback of SSh in the present context and under the chosen implementation details is the underlying assumption of time invariance of the correlation structure, which will be revisited in Setting 3. The larger the absolute difference between ρ and ρ 0 , the greater the improvement of SSh relative to ECC-Q. This is due to the fact that it becomes more and more beneficial to learn the dependence template from past observations rather than the raw ensemble, the less information the ensemble provides about the true dependence structure. GCA also tends to outperform ECC-Q if the differences between ρ and ρ 0 are large; however, GCA always performs worse than SSh and shows significantly worse performance than ECC-Q if the misspecifications in the ensemble are not too large (i.e., if ρ and ρ 0 are equal or similar).
The relative performance of ECC-S depends on the ordering of ρ and ρ 0 . If ρ > ρ 0 , ECC-S significantly outperforms ECC-Q; however, if ρ < ρ 0 significant ES differences in favor of ECC-Q can be detected. For dECC, the performance further depends on the misspecification of the variance structure in the marginal distributions. If ρ > ρ 0 , the  DM test statistic values move from positive (improvement over ECC-Q) to negative (deterioration compared to ECC-Q) values for increasing σ . By contrast, if ρ < ρ 0 the values of the test statistic instead change from negative to positive for increasing σ . The differences are mostly statistically significant and indicate the largest relative improvements among all methods in cases of the largest possible differences between ρ and ρ 0 . However, note that for some of those parameter combinations with small ρ and large ρ 0 , even EMOS-Q can outperform ECC-Q and ECC-S. In these situations, the raw ensemble forecasts contain very little information about the dependence structure, and the ES can be improved by assuming independence instead of learning the dependence template from the ensemble.
Results in terms of the VS are shown in Fig. 3. Most of the conclusions from the results in terms of the ES extend directly to comparisons based on the VS. SSh consistently remains among the best performing methods and provides significant improvements over ECC-Q unless ρ = ρ 0 ; however, alternative approaches here outperform SSh more often. Notably, the relative performance of GCA is consistently better in terms of the VS than in terms of the ES. For example, the differences between GCA and SSh appear to generally be negligible, and GCA does not perform worse than ECC-Q for any of the simulation parameter combinations. These differences between the results for GCA in terms of ES and VS may be explained by the greater sensitivity of the VS to misspecifications in the correlation structure, whereas the ES shows a stronger dependence on the mean vector.
For ECC-S and dECC, the general dependence on values of ρ, ρ 0 , and σ (for dECC) is similar to the results for the ES, but the magnitude of both positive as well as negative differences to all other methods is increased. For example, it is now possible to find parameter combinations where either ECC-S or dECC (or both) substantially outperform both GCA and SSh.

The role of ensemble size m
To assess the effect of the ensemble size m on the results, additional simulations have been performed with the simulation parameters from Fig. 2 but with ensemble sizes between 5 and 100. Corresponding figures are provided in the Supplement. Overall, the relative ranking of the different methods is only very rarely affected by changes in the ensemble size. The relative differences in terms of the ES between ECC-Q and ECC-S and between ECC-Q and GCA become increasingly negligible with increasing ensemble size. Further, SSh shows improved predictive performance for larger numbers of ensemble members for ρ 0 < ρ in the case of the ES and for ρ 0 > ρ in the case of the VS. The relative performance of dECC is strongly affected by changes in m for large misspecifications in the correlation parameters. A positive effect of larger numbers of members relative to ECC-Q in terms of both scoring rules can be detected for ρ 0 > ρ when σ < 1 and for ρ 0 < ρ when σ > 1. In both cases, the corresponding effects are negative if the misspecification in σ is reversed.

The role of dimension d
Additional simulations were further performed with dimensions d between 2 and 50 and the simulation parameters from above. In the interest of brevity, we refer to the Supplement for corresponding figures. In terms of the ES, the results for ECC-S are largely not affected by changes in dimension, whereas the relative performance of ECC-S improves with increasing d, and minor improvements over ECC-Q can be detected even for correctly specified correlation parameters for high dimensions. For GCA, a marked deterioration of relative skill can be observed in terms of the ES, which can likely be attributed to the sampling effects discussed above. In terms of the VS, GCA partly shows the best relative performance among all methods for dimensions between 10 and 20 and performs worse in higher dimensions. The relative differences in predictive performance in favor of SSh are more pronounced in larger dimensions, in particular in cases with large misspecification of the correlation parameters. Changes in the relative performance of dECC in terms of both scoring rules for increasing numbers of dimensions are similar to those observed for increasing numbers of ensemble members.

Setting 2: multivariate censored GEV distributions
The four considered scenarios in Table 2 constitute different types of deviation of the ensemble from the observation properties. Results for Scenario B are given below in Fig. 4, while the corresponding figures for Scenarios A, C, and D can be found in Sect. S2.1. As the GEV 0 distribution yields extreme outliers much more frequently than the Gaussian distribution in Setting 1, all figures (here and in the Supplement) show only those values that are within the 1.5× interquartile range, so that the overall comparison of the boxplots does not suffer from single extreme outliers.
-In Scenario B the location is correctly specified, but scale and shape are misspecified such that ensemble forecasts have both larger scale and shape, resulting in a heavier right tail and slightly higher point masses at zero. This scenario is taken as a reference among the four considered ones and shown in Fig. 4. Additional figures with results for the remaining scenarios are provided in the Supplement. Multivariate post-processing improves considerably upon the raw ensemble. ECC-Q is outperformed by SSh and GCA only when the absolute difference between ρ 0 and ρ becomes larger. As before, this is likely caused by the use of past observations to determine the dependence template by GCA and SSh, which proves beneficial in comparison to ECC-Q in cases of a highly incorrect correlation structure in the ensemble. For correctly specified correlations (panels on the main diagonal in Fig. 4), the relative performance of the methods does not depend on the actual value of correlation.
-In Scenario A the observation location parameter is shifted from 0 to a positive value for the ensemble, the observation scale is larger, and the shape is smaller than in the ensemble. Therefore, the ensemble forecasts come from a distribution with smaller spread than the observations, which is also centered away from 0 and has lower point mass at 0. In comparison to Scenario B there are more outliers, especially for ECC-S. In the case of correctly specified correlations, the performance of the methods also does not depend on the actual value of correlation as in Scenario B. Notably, EMOS-Q here performs mostly similar to the ensemble, while in the other three scenarios it typically performs worse than the ensemble if ρ > ρ 0 .
-In Scenario C the observation location is larger, the scale smaller, and the shape larger than in the ensemble distribution. This results in an observation distribution with a much heavier right tail and a much larger point mass at 0 compared to the ensemble distribution.
Here, post-processing models frequently offer no or only slight improvements over the raw ensemble. While ECC-Q does not always outperform the raw ensemble forecasts, SSh still shows improved forecast performance. As in the other scenarios, in the case of correctly specified correlations, the performance of the methods does not depend on the actual value of correlation.
-In Scenario D all univariate distribution parameters are correctly specified. Therefore, the main differences in performance are imposed by the different misspecifications of the correlation structure. The main difference compared to the other scenarios is given by the markedly worse effects of not accounting for multivariate dependencies during post-processing (EMOS-Q).
In general, the methods perform differently across the four scenarios, but for most situations multivariate postprocessing improves upon univariate post-processing without accounting for dependencies. Furthermore, SSh reveals a good performance in all four scenarios when ρ 0 differs considerably from ρ. The performance of SSh has a tendency to improve further when the observation correlation is larger than the ensemble correlation. Within each of the four scenarios, the performance of the methods is nearly identical in cases where the correlation is correctly specified. In other words, as long as the ensemble forecasts correctly represent the correlation of the observations, the actual value of the correlation does not have an impact on the performance of a multivariate post-processing method. Above-described observations can be found in terms of both the ES and the VS.
In addition to the scenarios from Table 2, further scenario variations were considered for ρ 0 = 0.75 and ρ = 0.25, that is, for the case where ensemble correlation is too low compared to the observations. Figure 5 shows the situation where the observation location parameter is larger, the scale smaller, but the shape also smaller than in the ensemble forecasts. This contrasts with the situation in Scenario C. While in C the observations were heavier tailed with higher point mass at 0, here it is the other way round (the ensemble distribution is heavier tailed with higher point mass at 0). In accordance with Scenarios A, B, and C (where there are parameter misspecifications in the ensemble compared to the observations), EMOS-Q performs better than the raw ensemble and also better than dECC (as in B and C), while SSh and GCA perform best. However, in contrast to results in terms of the ES, GCA exhibits an even better performance compared to the other models in terms of the VS. This indicates that the VS is better able to account for the correctly specified (or by post-processing improved) correlation structure than the ES.

The role of ensemble size m
To assess the effect of the ensemble size m, additional simulations have been performed for each of the four scenarios in Table 2 with ensemble sizes between 5 and 100. Corresponding comparative figures comparing ensemble sizes m = 5, 20, 50, and 100 for Scenarios A, B, C, and D are provided in Sect. S2.2. Overall, the size of the ensemble only has a minor effect on the relative performance of the multivariate methods apart from GCA, which strongly benefits from an increasing number of members across all four scenarios, specifically with regard to ES. This improvement is likely due to the sampling issues discussed above and is less pronounced in terms of the VS. As in Setting 1 the relative differences between ECC-Q and ECC-S in terms of the ES become increasingly negligible with increasing ensemble size in all considered scenarios (especially for ρ 0 = ρ). This phenomenon is also less pronounced for the VS. In contrast to the methods using the dependence information, the performance of EMOS-Q (not accounting for dependence) compared to ECC-Q becomes increasingly worse for an increasing number of members when measured by ES. For VS, the influence of the number of members on EMOS-Q is only small. Interestingly, the difference in performance of the raw ensemble for an increasing number of members is negligible in case the misspecification is only minor and ES is considered. In case there is no misspecification (Scenario D), the raw ensemble can slightly benefit from an increasing number of members. Similarly to the effect for ECC-Q, when measuring performance with VS, the effect on the raw ensemble is negligible. Further, it can be observed that the difference of the results between varying numbers of members is smallest for ρ 0 = ρ.

Setting 3: multivariate Gaussian distribution with
changes over iterations Figure 6 shows results in terms of ES and VS for Setting 3A. We again only show results for ρ, ρ 0 ∈ {0.25, 0.5, 0.75} and refer to the Supplement for further results. The most notable differences compared to Setting 1 are that the different ECC variants here significantly outperform GCA and SSh not only for ensemble forecasts with correctly specified correlation structure, but also for small deviations of ρ from ρ 0 . Significant ES differences in favor of SSh are only obtained for large absolute differences of ρ and ρ 0 . Similar observations hold for GCA which, however, generally exhibits worse performance compared to SSh. The ES differences among the ECC variants are only minor and usually not statistically significant. Similar conclusions apply for the VS; however, GCA generally performs better than SSh, and ECC-S provides significantly worse forecasts compared to the other ECC variants for ρ < ρ 0 .
Results for Setting 3B are shown in Fig. 7. Note that the columns here show different values of ρ and the row refers to a specific value of a. Similar to Setting 3A, we observe that in terms of the ES, dECC and ECC-S do not show significant differences in performance compared to ECC-Q, whereas GCA and SSh here perform worse for all parameter combinations. In terms of the VS, GCA now also performs worse than ECC-Q for all correlation parameters, whereas significantly negative and positive differences of ECC-S and dECC compared to ECC-Q can be detected for ρ < ρ 0 and ρ > ρ 0 , respectively. Additional results for varying values of σ and a and sets of low or medium correlations are provided in the Supplement. The results generally do not depend on the choice of σ . Results for low and medium correlation parameter values are characterized by less substantial differences between the methods. In particular, it is only rarely possible to detect significant differences when comparing ECC-Q and SSh, and GCA only performs significantly worse in terms of the ES. Further, there exist more parameter combinations with improvements by ECC-S and dECC. However, note that due to the setup of Setting 3B, the variations over time in both observations and ensemble predictions will be much smaller than for high correlation parameter values. Within a fixed set of correlation parameters, the relative differences between the methods become more pronounced with increasing values of a.
Note that the main focus in both variants of Setting 3 was to demonstrate that in (potentially more realistic) settings with changes over time, naive implementations of the Schaake shuffle can perform worse than ECC variants. However, similarity-based implementations of the Schaake shuffle (Schefzik, 2016;Scheuerer et al., 2017) are available and may be able to alleviate this issue.

Discussion and conclusion
State-of-the-art methods for multivariate ensemble postprocessing were compared in simulation settings which aimed to mimic different situations and challenges occurring in practical applications. Across all settings, the Schaake shuffle constitutes a powerful benchmark method that proves difficult to outperform, except for naive implementations in the presence of structural change (for example, time-varying correlation structures considered in Setting 3). By contrast to SSh, the Gaussian copula approach typically only provides improvements over variants of ensemble copula coupling if the parametric assumption of a Gaussian copula is satisfied or if forecast performance is evaluated with the variogram score. Results in terms of the CRPS further highlight an additional potential disadvantage in that the univariate forecast errors are larger compared to the competitors.  Not surprisingly, variants of ensemble copula coupling typically perform the better the more informative the ensemble forecasts are about the true multivariate dependence structure. A particular advantage compared to standard implementations of SSh and GCA illustrated in Setting 3 may be given by the ability to account for flow-dependent differences in the multivariate dependence structure if those are (at least approximately) present in the ensemble predictions, but not in a randomly selected subset of past observations.
There is no consistently best method across all simulation settings and potential misspecifications among the different ECC variants investigated here (ECC-Q, ECC-S, and dECC). ECC-Q provides a reasonable benchmark model and will rarely yield the worst forecasts among all ECC variants. Significant improvements over ECC-Q may be obtained by ECC-S and dECC in specific situations, including specific combinations of ensemble size and dimension. For example, dECC sometimes works well for underdispersive ensembles where the correlation is too low, whereas ECC-S may work better if the ensemble is underdispersive and the correlation is too strong. However, the results will strongly depend on the exact misspecification of the variance-covariance structure of the ensemble as well as the performance measure chosen for multivariate evaluation.
In light of the presented results it seems to be generally advisable to first test the Schaake shuffle along with ECC-Q. If structural assumptions about specific misspecifications of the ensemble predictions seem appropriate, extensions by other variants of ECC or GCA might provide improvements. However, it should be noted that the results for real-world ensemble prediction systems may be influenced by many additional factors and may differ when considering station-based or grid-based post-processing methods. The computational costs of all presented methods are not only negligible in comparison to the generation of the raw ensemble forecasts, but also compared to the univariate post-processing as no numerical optimization is required. It may thus be generally advisable to compare multiple multivariate post-processing methods for the specific dataset and application at hand.

S. Lerch et al.: Comparison of multivariate post-processing methods
The simulation settings considered here provide several avenues for further generalization and analysis. For example, a comparison of forecast quality in terms of multivariate calibration (Thorarinsdottir et al., 2016;Wilks, 2017) is left for future work. Further, the autoregressive structure of the correlations across dimensions may be extended towards more complex correlation functions; see, e.g., Thorarinsdottir et al. (2016, Sect. 4.2). While we only considered multivariate methods based on a two-step procedure combining univariate post-processing and dependence modeling via copulas, an extension of the comparison to parametric approaches along the lines of Feldmann et al. (2015) and Baran and Möller (2015) presents another starting point for future work. Note that within the specific choices for Setting 1, the spatial EMOS approach of Feldmann et al. (2015) can be seen as a special case of GCA.
We have limited our investigation to simulation studies only as those settings allow one to readily assess the effects of different types of misspecifications of the various multivariate properties of ensemble forecasts and observations and may thus help to guide implementations of multivariate postprocessing. Further, they are able to provide a more complete picture of the effects of different types of misspecifications on the performance of the different methods than those that may be observed in practical applications. Nonetheless, an important aspect for future work is to complement the comparison of multivariate post-processing methods by studies based on real-world datasets of ensemble forecasts and observations, extending existing comparisons of subsets of the methods considered here (e.g., Schefzik et al., 2013;Wilks, 2015). However, the variety of application scenarios, methods, and implementation choices likely requires large-scale efforts, ideally based on standardized benchmark datasets. A possible intermediate step might be given by the use of simulated datasets obtained via stochastic weather generators (see, e.g., Wilks and Wilby, 1999) which may provide arbitrarily large datasets with possibly more realistic properties than the simple settings considered here.
A different perspective on the results presented here concerns the evaluation of multivariate probabilistic forecasts. In recent work Ziel and Berk (2019) argue that the use of Diebold-Mariano tests is of crucial importance for appropriately assessing the discrimination ability of multivariate proper scoring rules and find that the ES might not have as bad a discrimination ability as indicated by earlier research. The simulation settings and comparisons of multivariate post-processing methods considered here may be seen as additional simulation studies for assessing the discrimination ability of multivariate proper scoring rules. In particular, the results in Sect. 4 are in line with the findings of Ziel and Berk (2019) in that the ES does not exhibit inferior discrimination ability compared to the VS. Nonetheless, the ranking of the different multivariate post-processing methods strongly depends on the proper scoring rule used for evaluation, and further research on multivariate verification is required to address open questions, improve mathematical understanding, and guide model comparisons in applied work.

Appendix A: Details on the left-censored generalized extreme value (GEV) distribution
When the GEV distribution is left-censored at zero, its cumulative distribution function can be written as This describes a three-parameter distribution family, where µ ∈ R, σ > 0, and ξ ∈ R are location, scale, and shape of the non-censored GEV distribution, respectively.

A1 Expectation and variance
Let Y be a random variable distributed according to GEV and censored at zero to the left. From the law of total expectation, where the second term in the sum is given by Here, I denotes the indicator function, g is any function of Y such that g(Y ) is a random variable, and f Y is the probability density function (PDF) of the non-censored GEV. By noting that E(Y |Y = 0) = E(Y 2 |Y = 0) = 0, expectation and variance of the left-censored GEV can be computed from the two integrals ∞ 0 yf Y (y) dy and ∞ 0 y 2 f Y (y) dy, the former existing when ξ < 1 and the latter existing when ξ < 0.5. Both integrals are not derived analytically here, but evaluated by numerical integration. In contrast to the non-censored GEV distribution, the variance of the left-censored version also depends on the parameter µ, since different choices of µ lead to different left-censored CDFs which are not merely distinguished by location. Therefore µ is a location parameter for the non-censored GEV, but not for the left-censored version.

Appendix B: Evaluating probabilistic forecasts B1 Proper scoring rules
The comparative evaluation of probabilistic forecasts is usually based on proper scoring rules. A proper scoring rule is a function S : F × → F, which assigns a numerical score S(F, y) to a pair of a forecast distribution F ∈ F and a realizing observation y ∈ . Here, F denotes a class of probability distributions supported on . The forecast distribution F may come in the form of a predictive CDF, a PDF, or a discrete sample as in the case of ensemble predictions. A scoring rule is called proper if for all F, G ∈ F and strictly proper if equality holds only if F = G. See Gneiting and Raftery (2007) for a review of proper scoring rules from a statistical perspective.
The most popular example of a univariate (i.e., ⊂ F) proper scoring rule in the environmental sciences is given by the continuous ranked probability score (CRPS), CRPS(F, y) = (F (z) − I{z ≥ y}) 2 dz.
Over the past years a growing interest in multivariate proper scoring rules has accompanied the proliferation of multivariate probabilistic forecasting methods in applications across disciplines. The definition of proper scoring rules from above straightforwardly extends towards multivariate settings (i.e., ⊂ F d ). A variety of multivariate proper scoring rules has been proposed over the past years, usually focused on cases where multivariate probabilistic forecasts are given as samples from the forecast distributions.
To introduce multivariate scoring rules, let y = (y (1) , . . ., y (d) ) ∈ ⊂ F d and let F denote a forecast distribution on F d given by m discrete samples X 1 , . . ., X m from F with X i = (X where · is the Euclidean norm on R d , and the variogram score of order p (VS p ; Scheuerer and Hamill, 2015), Here, w i,j is a non-negative weight that allows one to emphasize or down-weight pairs of component combinations, and p is the order of the variogram score. Following suggestions of Scheuerer and Hamill (2015), we considered p = 0.5 and p = 1. As none of the simulation settings indicated any substantial differences, we set p = 1 throughout and denote VS 1 (F, y) by VS(F, y). Since the generic multivariate structure of the simulation settings does not impose any meaningful structure in pairs of components, we focus on the unweighted versions of the variogram score. Several weighting schemes have been tested but did not lead to any substantially different conclusions.
We utilize implementations provided in the R package scoringRules (Jordan et al., 2019) to compute univariate and multivariate scoring rules for forecast evaluation and post-processing model estimation.

B2 Diebold-Mariano tests
Statistical tests of equal predictive performance are frequently used to assess the statistical significance of observed score differences between models. We focus on Diebold-Mariano (DM;Diebold and Mariano, 1995) tests which are widely used in the econometric literature due to their ability to account for temporal dependencies. For applications in the context of post-processing, see, e.g., Baran and Lerch (2016).
For a (univariate or multivariate) proper scoring rule S and sets of two competing probabilistic forecasts F i and G i , i = 1, . . ., n test over a test set, the test statistic of the DM test is given by where S(F, y) = 1 n test n test i=1 S(F i , y i ) and S(G, y) = 1 n test n test i=1 S(G i , y i ) denote the mean score values of F and G over the test set of size n test , respectively. In Eq. (B1), σ denotes an estimator of the asymptotic standard deviation of the sequence of score differences of F and G. Positive values of T DM n test indicate a superior performance of G, whereas negative values indicate a superior performance of F .
Under standard regularity assumptions and the null hypothesis of equal predictive performance, T DM n test asymptotically follows a standard normal distribution which allows one to assess the statistical significance of differences in predictive performance. We utilize implementations of DM tests provided in the R package forecast (Hyndman and Khandakar, 2008).
Code availability. R code with implementations of all simulation settings as well as code to reproduce the results presented here and in the Supplement are available from https://doi.org/10.5281/zenodo.3826277 .
Author contributions. All the authors jointly discussed and devised the design and setup of the simulation studies. A variant of Setting 1 was first investigated in an MSc thesis written by MG (Graeter, 2016), co-supervised by SL. SL wrote evaluation and plotting routines, implemented simulation settings 1 and 4, partially based on code and suggestions from MG and SH, and provided a simulation framework in which the variant of Setting 1 based on a multivariate truncated normal distribution (SB) and Setting 2 (AM and JG) were implemented. All the authors jointly analyzed the results and edited the manuscript, coordinated by SL.
Competing interests. Sebastian Lerch and Stephan Hemri are editors of the special issue on "Advances in post-processing and blending of deterministic and ensemble forecasts". The remaining authors declare that they have no conflict of interest.
Special issue statement. This article is part of the special issue "Advances in post-processing and blending of deterministic and ensemble forecasts". It is not associated with a conference.