Order of operation for multi-stage post-processing of ensemble wind forecast trajectories

Abstract. With numerical weather prediction ensembles unable to produce sufficiently calibrated forecasts, statistical postprocessing is needed to correct deterministic and probabilistic biases. Over the past decades, a number of methods addressing this issue have been proposed, with ensemble model output statistics (EMOS) and Bayesian model averaging (BMA) among the most popular. They are able to produce skillful deterministic and probabilistic forecasts for a wide range of applications. These methods are usually applied to the newest model run as soon as it finished, before the entire forecast trajectory is issued. 5 RAFT (rapid adjustment of forecast trajectories), a recently proposed novel approach, aims to improve these forecasts even further, utilizing the error correlation patterns between lead times. As soon as the first forecasts are verified, we start updating the remainder of the trajectory based on the newly gathered error information. As RAFT works particularly well in conjunction with other post-processing methods like EMOS and techniques designed to reconstruct the multivariate dependency structure like ensemble copula coupling (ECC), we look to identify the optimal combination of these methods. In our study, we apply 10 multi-stage post-processing to wind speed forecasts from the UK Met Office’s convective-scale MOGREPS-UK ensemble and analyze results for short-range forecasts at a number of sites in the UK and the Republic of Ireland.

by means of an example forecast and present results for the selected combinations of post-processing methods. We conclude with a discussion in Section 6.

60
The 10m instantaneous wind speed forecast data used in this study was produced by the UK Met Office's limited-area ensemble MOGREPS-UK (Hagelin et al., 2017). MOGREPS-UK is based on the convection-permitting NWP model UKV, but with a lower resolution of 2.2km. Until March 2016, the global ensemble MOGREPS-G produced both initial and boundary conditions for MOGREPS-UK; subsequently perturbations from the global ensemble were combined with UKV analysis increments to generate the initial conditions. 65 We use data from all model versions between January 2014 and June 2016, during which the ensemble was initialized four times a day and consisted of 12 members, one control and 11 perturbed forecasts. Here, we only look at the model run started at 15 UTC, as it was observed in Schuhen et al. (2020) that all four runs behave somewhat similar in terms of predictability.
Forecasts are produced for every hour up to 36 hours, covering the short range.
For both estimation and evaluation, SYNOP observations from 152 sites in the British Isles are used (see Fig. 1). To match 70 the observation locations, the forecasts were interpolated from the model grid and subjected to Met Office post-processing in order to correct for local effects and differences between the model and the location's orography. We separate our data set into two parts: the first 12 months are used for estimating the RAFT coefficients and the remaining 18 months for evaluating the post-processing techniques.
3 Post-processing methods 75 In this paper, several post-processing methods are used in various combinations. They all fulfill different purposes: EMOS (ensemble model output statistics) functions as a baseline for producing calibrated and sharp probabilistic forecasts, ECC (ensemble copula coupling) transfers the physical dependency structure of the ensemble to the EMOS forecasts and RAFT (rapid adjustment of forecast trajectories) continually improves the EMOS deterministic forecasts after they have been issued, based on previously not available information.

Ensemble model output statistics
In a first step, all forecasts are post-processed with EMOS, sometimes also called non-homogeneous regression, in order to correct deterministic and probabilistic biases the raw ensemble might suffer from. These deficiencies are a result of the limits of ensemble forecasting in general, as e.g., the ensemble members can only represent a small subset of the multitude of all possible or probable states of the atmosphere at any given point in time. Thorarinsdottir and Gneiting (2010) propose an application of 85 the EMOS method for wind speed forecasts based on truncated Gaussian distributions, although they study maximum instead of instantaneous wind speed. As we will see in Sect. 5, this approach (here called gEMOS) produces nearly calibrated forecasts, but they are still slightly underdispersive. For this reason, we investigate a second variant of EMOS introduced by Scheuerer and Möller (2015), logE-MOS, where the predictive distributions are truncated logistic. Due to its heavier tails, the logistic distribution can provide a 90 better fit to the instantaneous wind speed data at hand. Further case studies including various versions of EMOS have shown that sharp and calibrated forecasts can be produced for a number of different NWP ensembles (e.g., Feldmann et al., 2015;Scheuerer and Büermann, 2014;Kann et al., 2009).
Let X 1 , . . . , X 12 denote an ensemble forecast valid at a specific time and location and Y be the corresponding observed wind speed. Then we model the gEMOS forecast as a truncated Gaussian distribution with cut-off at zero, in order to account for the 95 non-negativity of the wind speed values: Due to the truncation, the negative part of the distribution is cut off and a corresponding probability mass added to the positive part. This means that the parameter µ here is not the mean of the distribution, but the location parameter, and σ 2 is the scale parameter. Using the ensemble meanX = 1 12 i=1 X i and variance S 2 = 1 12 12 i=1 X i −X 2 as predictors for the EMOS 100 parameters µ and σ 2 , we define the following equations: The coefficients b, c and d are squared in order to simplify interpretability and to make sure that the scale parameter is positive. Minimum score estimation is a versatile way to obtain parameter estimates in such a setting (Dawid et al., 2016). The

105
proper score we want to optimize is the continuous ranked probability score (CRPS; Matheson and Winkler, 1976;, which addresses both important forecast properties, sharpness and calibration (for details see Sect. 4).
We process all locations and lead times separately, equivalent to the local EMOS approach in Thorarinsdottir and Gneiting (2010), and the training data consists of a rolling period of 40 days. In practice, this means that the training period contains forecast-observation pairs from the last 40 days preceding the start of the model run, valid at the same lead time and location.

110
In the case of logEMOS, we substitute the truncated Gaussian distribution in Eq.
(1) with a truncated logistic distribution: where µ is again the location parameter and s = √ 3σ 2 · π −1 the scale. The location parameter µ and variance σ 2 are linked to the ensemble statistics in the same way as in Eq.
(2). Scheuerer and Möller (2015) provide a closed form of the CRPS for a truncated logistic distribution, meaning that gEMOS and logEMOS are comparable in terms of computational cost and 115 complexity. We found parameter estimation to be more stable when applying EMOS to wind speed in knots as compared to meters per second. The ensemble members are treated as exchangeable, in that we use the ensemble mean as predictor for the EMOS location parameter. This results in more robust parameters and faster computation.

Ensemble copula coupling
While EMOS is particularly adept at calibrating ensemble forecast, the ensemble's rank structure is lost in the process. To 120 restore the physical dependencies between forecasts at different lead times, we employ ensemble copula coupling (ECC; e.g., Schefzik et al., 2013 12 . We obtain a permutation τ l of the numbers 1, . . . , 12 such that Any ties are resolved at random. Then we apply τ l to the EMOS quantiles X The new ensemble has the same univariate properties as the original EMOS quantiles, as only the order of the ensemble members has changed. However, when we evaluate it using multivariate scores and verification tools, we can see the benefit of ECC. It is a computationally efficient and straightforward method to preserve spatial and temporal features of the NWP model.

135
ECC has been used in a variety of atmospheric and hydrological forecasting scenarios, e.g., Schuhen et al. (2012), Hemri et al. (2015) and continually updates the forecast after it has been issued, using information from the part of the forecast trajectory that has already realized. Essentially, RAFT applies to any weather variable, therefore we do not have to make many alterations to the method for temperature described in Schuhen et al. (2020). We treat all locations separately, as the local error characteristics vary greatly.
In this paper, there are two different RAFT concepts used: we call the standard method that adjusts the EMOS mean RAFT m , 145 while RAFT ens applies to individual ensemble members drawn from the EMOS distribution. RAFT m therefore can only improve the deterministic forecast skill, whereas RAFT ens provides an adjusted empirical distribution spanned by the ensemble.
Both RAFT variants are based on the correlation between observed forecast errors at different lead times. We define the error e t,l at a particular lead time l, generated from a model run started at time t, as the difference between the forecast and the observation y t+l : Equation (7) refers to the RAFT m approach, where m t,l is the mean of the EMOS distribution. For RAFT ens , we need to calculate the error for every ensemble member x (i) t,l (Eq. (8)). To obtain the mean of the truncated Gaussian distribution from the location and scale parameters µ and σ 2 , we use the following relationship: The functions ϕ and Φ here denote the density and cumulative distribution function of the standard Gaussian distribution, respectively. Similarly, the mean of the truncated logistic distribution is From the forecast errors e t,l and e (i) t,l , we generate the Pearson correlation coefficient matrix to establish the relationship between the 36 lead times. In the RAFT ens case this means looking at the correlation matrices of each ensemble member separately. The left column in Fig. 2 shows the gEMOS error correlation matrix for the weather station on The Cairnwell mountain in the Scottish Highlands. The top plot refers to RAFT m , while the bottom pictures the correlation for one member of the RAFT ens ensemble. All correlations shown are statistically significant at the 90% level. There is a good correlation between 165 a sizeable number of lead times, which makes it possible to define an adjustment period for each lead time, telling us at what point in time to begin with the RAFT adjustments. While the adjustment period applies, we know that a previously observed error e t,l * at lead time l * < l gives us reliable information about the future error e t,l .
The RAFT model used to obtain the estimated future errorê t,l at l = 1, . . . , 36 is based on linear regression with the observed error e t,l * as predictor: where the random error term ε is normally distributed with mean zero. The regression coefficientsα andβ are determined using least squares. Once we have estimated the RAFT regression coefficients for every possible combination of lead times l and l * , we can establish the length p of the individual adjustment periods by looking at those combinations where the estimate of the coefficientβ is significantly greater than zero, meaning that e t,l * is likely to provide useful information for the prediction The algorithm for determining the adjustment period corresponds to the one described in Schuhen et al. (2020). In general,   Finding the optimal adjustment periods concludes the estimation part of RAFT. The actual adjustment of the predicted forecast error happens in real time once the current model run has finished and the forecasts have been issued. For lead time l, the adjustment starts at l − p + 1, using the observation recorded at l − p, and then continues hourly until l − 1. The smaller the gap between l and the time the observation was recorded, the greater the value of the error information and therefore the larger 205 the gain in forecast skill.
In practice, we calculate the observed error according to Eq. (7), plug it into Eq. (11) with the appropriate coefficientsα andβ and obtain the predictive errorê t,l . Then we can add this forecast to the EMOS mean m t,l for RAFT m or the ensemble member x (i) t,l , i = 1, . . . , 12 drawn from the EMOS distribution for RAFT ens : Any values ofm t,l andx (i) t,l that become negative during this process are set to zero in order to account for the non-negativity of wind speed. While we can use the RAFT-adjusted meanm t,l as a deterministic forecast, the corresponding location parameter µ t,l is needed to evaluate the full distribution. For this purpose, we solve Equations (9) and (10) numerically for µ. This approach can be quite unstable and has to be done carefully so that the resulting distribution is valid. We then combine the 215 new location parameter with the unchanged EMOS variance and thus obtain a predictive distribution. In the case of RAFT ens , the ensemble members span a discrete distribution. Therefore, we here not only adjust the deterministic forecast, but also simultaneously the spread of the distribution in an adaptive and flow-dependent way.

Evaluation methods
There is a multitude of evaluation methods available to assess both deterministic and probabilistic forecast skill (see e.g., 220 Thorarinsdottir and Schuhen, 2018). In addition to looking at univariate verification results, we also want to determine the benefit of various combinations of post-processing methods in a multivariate sense.
Proper scoring rules  are useful tools that assign a numerical value to the quality of a forecast and always judge the optimal forecast to have the best score. Usually, they are averaged over a number of forecast cases n. In the deterministic case, the root-mean-square error (RMSE) gives an indication about the forecast accuracy of the mean forecast, 225 be it the mean of a distribution or an ensemble mean. It is defined as where y is the verifying observation corresponding to the predictive distribution F .
For evaluating probabilistic forecasts, the CRPS (Matheson and Winkler, 1976) is an obvious choice. Given the score's robustness, it is often used for parameter estimation, as in the two EMOS variants gEMOS and logEMOS described in Sect. 3.1.

230
A closed form of the CRPS for the truncated Gaussian was derived by Thorarinsdottir and Gneiting (2010) as where Φ is the CDF and ϕ the PDF of a standard normal distribution. For the truncated logarithmic distribution, a closed form is also available (Scheuerer and Möller, 2015): Here, logit (·) is the logit function and p 0 = Λ −µs −1 and p y = Λ (y − µ) s −1 are values of the CDF of the truncated logistic distribution Λ. To be able to compare all types of forecasts in a fair way, we draw random samples X 1 , . . . , X 12 from every continuous predictive distribution and evaluate them using the ensemble version of the CRPS: Furthermore, we want to assess the level of calibration in a forecast separately, as it is important to prefer the sharpest forecast subject to calibration . To this end, we check the verification rank histogram (Anderson, 1996;Hamill and Colucci, 1997;Talagrand et al., 1997), where we find the ranks of the observation within the forecast ensemble for each forecast case and plot them as a histogram. A ∩-shaped histogram points towards overdispersed forecast distributions, 245 while a ∪ shape means that the forecasts do not exhibit enough spread. For perfect calibration, a flat histogram is a necessary condition, although not sufficient (Hamill, 2001).
The direct equivalent of the CRPS for multivariate forecasts, the energy score , is defined as where X and X are independent random vectors drawn from the multivariate distribution F , y is the observation vector and

250
. is the Euclidean norm. If we replace the absolute value in Eq. (19) with the Euclidean norm, we obtain an analogous version of the energy score for ensemble member vectors. It is also possible to evaluate deterministic forecasts in multiple dimensions using the Euclidean error, which we derive from the energy score by replacing the distribution F with a point measure: The multivariate point forecast med F is the spatial median, computed numerically using the R package ICSNP (Nordhausen 255 et al., 2015). It minimizes the sum of the Euclidean distances to the ensemble members.
While the energy score is generally more sensitive to errors in the predictive mean (Pinson and Tastu, 2013), the variogram score proposed by Scheuerer and Hamill (2015) is better at identifying whether the correlation between the components is correct. In addition to following the authors' recommendation and setting the score's order p to 0.5, we assign equal weights to all lead times. The variogram score then becomes where y i and y j are the ith and jth component of the observation vector and and X i and X j components of a random vector distributed according to F .
Finally, there are several possibilities to check multivariate calibration, like the multivariate rank histogram (Gneiting et al., 2008), the band depth histogram and the average rank histogram (both Thorarinsdottir et al., 2016). We choose to use the latter 265 in this case, as it is less prone to give misleading results than the multivariate rank histogram and more easily interpretable than the band depth histogram. First, a so-called prerank is calculated, corresponding to the average univariate rank of the vector components:

275
It is the purpose of this paper to investigate if there is a preferred order in applying three different kinds of post-processing methods. In particular, it will be important to see whether ECC should be run once, like EMOS, subsequent to the end of the NWP model run, or if it should be continuously applied every time the RAFT adjustment occurs. Therefore, there are two combinations of methods to be tested: EMOS + RAFT m + ECC, where RAFT is applied to the EMOS mean only and EMOS + ECC + RAFT ens , where we adjust the EMOS/ECC ensemble members and thus at the same time the prediction of uncertainty.

280
As we are interested in a comprehensive assessment of the individual combinations' performance, all scores, whether univariate or multivariate, are of equal importance.

Example forecast
First, we take a look at an example forecast to illustrate how the different RAFT variants work. Fig. 3  with the observed error measured two hours earlier and are the most short-term and therefore optimal RAFT forecasts.
In this weather situation, both mean forecasts initially underpredict the wind speed for roughly 12 hours starting from lead time 10, corresponding to the time between 1:00 and 13:00 UTC. A further period of underprediction occurs towards the end 295 of the trajectory, from lead time 28. RAFT is able to recognize these problems quickly and corrects the underprediction quite well, as can be observed in the bottom two panels. However, as the observations are quite jumpy, the sign of the forecast error changes frequently during the adjustment process and the RAFT mean trajectory thus can also exhibit more jumpiness than the initial EMOS mean. This could be addressed by e.g., adding additional predictors to the RAFT linear regression model in Eq. (11).

300
There are only minor differences in the mean forecasts between the two post-processing method combinations, while their main difference lies in the derivation of the predictive variance. We can see that the size of the prediction interval for gEMOS + ECC + RAFT ens changes considerably between the first and the last RAFT adjustment. This is of course because the ensemble, and therefore the prediction interval spanned by its members, is continuously updated and adjusted in a flow-dependent manner.
For example, at the end of the trajectory the ensemble spread in Fig. 3c is much smaller than in Fig. 3d. In the case of gEMOS 305 + RAFT m , the variance is not changed by RAFT and remains at the value originally estimated by EMOS.

Choice of EMOS model
As we tested two versions of EMOS using two different distributions to model the future wind speed observations, we are interested in which of these, if any, performs better. Initially, we compare the deterministic forecast skill of the EMOS mean and how it is improved by RAFT. In Fig. 4, the RMSE of the gEMOS and logEMOS mean, averaged over all sites and model 310 runs, is shown. Both methods perform very similar, but gEMOS seems to have a slight advantage overall, apart from the first three hours and the last hour. There is a small increase in the RMSE for logEMOS at lead time 23, which is most certainly due to issues finding the minimum CRPS during the EMOS parameter estimation, where all lead times are handled separately.
The ranking of the two EMOS variants is preserved when applying RAFT m to the EMOS mean forecast. Figure 4a shows the RAFT RMSE if we stopped adjusting the forecasts at t + 15. This means that all forecasts left of the vertical line have been 315 updated using the observation made two hours earlier, and all forecasts to the right of the line are adjusted using the most recent information available at t + 15, i.e., the observed error at t + 14. However, this only applies to those forecasts where lead time 14 lies in the respective adjustment periods. For all other forecasts, the scores for EMOS and RAFT coincide. It is noticeable that the forecast skill improves significantly as soon as we have information about the error in the current model run at t + 3.    RAFT is only carried out until the adjustment at t + 15. (b) RAFT is carried out until its last iteration at t + 35.
The score remains at about the same level until t + 16, when it starts to deteriorate, but RAFT still has as the advantage over 320 the EMOS forecasts for another ten hours. In reality, however, we would run RAFT until the end of the forecast cycle, which is shown in Fig. 4b. Here, we can see a consistent improvement, especially at large lead times. We also see that the forecasts at lead times 25-26 have more skill than the ones at lead times 1-2, which leads to the conclusion that forecasts from a 24 hour old model run are for a couple of hours more skillful than forecast from the newest run.
The first column in Table 1 confirms these results. Here, the scores have been aggregated over all lead times, model runs 325 and sites. In this table only scores for RAFT forecasts that have been adjusted one hour previously are shown, i.e., the optimal forecasts. We test the significance of score differences by applying a permutation test based on resampling, as described in Heinrich et al. (2019) and Möller et al. (2013). Both EMOS methods increase the deterministic skill considerably when compared to the raw ensemble and then are further improved by applying RAFT m . While the RMSE for gEMOS is significantly better than for logEMOS, which we also see in the CRPS, there is almost no difference in the gEMOS + RAFT m and logEMOS + RAFT m scores. In terms of the CRPS, logEMOS + RAFT m has a slight advantage, with the difference being significant at the 95% level.
To confirm that the EMOS forecasts are indeed calibrated, we look at the verification rank histograms in Fig. 5a. While the raw ensemble is very underdispersive, as expected, both gEMOS and logEMOS forecasts are nearly calibrated. Both gEMOS and logEMOS histograms are again very similar, so we compute the coverage of the 84.6% prediction interval created by 12 335 ensemble members. From the results we see that logEMOS, with a value of 85.18%, is much closer to the nominal value than gEMOS with 80.56% and therefore better calibrated. Figure 5b shows the histograms after we apply RAFT m . Whereas the EMOS variance was on average slightly too small before, is is now a little to big, indicated by the small hump in the middle.
This is due to the fact that we don't adjust the variance in this process, but the deterministic skill improves greatly. There is almost no difference in the two histograms, which is also evident in the coverage of the prediction interval, with values of 340 83.80% and 83.44% for gEMOS and logEMOS, respectively.
In conclusion, there is little difference in the overall performance of the two EMOS variants. While logEMOS has the advantage of being slightly better calibrated, gEMOS shows better scores. After applying RAFT, the two methods are essentially equal. In the following we will therefore only present results from one of the two EMOS versions.

345
The main focus of this study is to find out, in which order EMOS, RAFT and ECC should be combined. For RAFT, we employ two different approaches: RAFT m , where the adjustments are only applied to the EMOS mean and are then combined with the EMOS variance to obtain a full predictive distribution, and RAFT ens , where we adjust individual ensemble members and consequently both mean and spread. In the latter case, ECC is only run once when EMOS has finished, in the first it has to be applied at every RAFT step for the remaining lead times in the forecast run. Therefore the required computing resources 350 depend on the ratio of ensemble members to lead times. In this study, the EMOS + RAFT m + ECC combination takes about 33% more time to compute than EMOS + ECC + RAFT ens , however both are with only a few seconds per model run and site computationally very sparse.
When we compare Figures 5b and 5c, it is obvious that the EMOS + ECC + RAFT ens combination produces forecasts which are slightly less calibrated than EMOS + RAFT m forecasts. In fact, the level of calibration deteriorates from the baseline 355 EMOS methods. This also can be deduced from the CRPS values in Table 1, where EMOS + RAFT m clearly performs better.
The RMSE for both methods is quite similar, so that we can ascribe the discrepancy in the CRPS to the different levels of calibration. Both methods improve the EMOS baseline forecast considerably. In the case of EMOS + RAFT m , we know this improvement in forecast skill is only due to the adjusted mean forecast, which simultaneously results in better calibrated predictive distributions.

360
As we are not only interested in the univariate performance, but also in the multidimensional dependencies between the lead times of a forecast trajectory, we look at several multivariate scores (Table 1). The Euclidean error agrees with the univariate RMSE that the best deterministic result can be achieved by applying RAFT last. ECC seems not to have any effect on the scores, which can be expected, as we are only rearranging ensemble members and not necessarily change the multivariate Table 1. Univariate and multivariate mean scores for different post-processing method combinations, using the final RAFT adjustments one hour before valid time. Bold numbers denote the best score in each column. All score differences are significant at the 95% level, apart from the ones marked with an asterisk, where the pairwise differences between the versions using gEMOS and logEMOS are not significant.  median. The energy score is a measure for the overall skill, but is also more sensitive to errors in the mean forecast. This 365 is the reason why RAFT m manages to improve the energy score as compared to EMOS + ECC, while the variogram score deteriorates. Note that both scores are reduced when we reintroduce the temporal correlation structure by applying ECC to the EMOS + RAFT m forecasts. Although the energy and variogram scores for EMOS + ECC + RAFT ens and EMOS + RAFT m + ECC are very close, the two scores prefer different post-processing method combinations. While the energy score judges the method to be the best where we apply ECC first, which also has the best RMSE and Euclidean error, the variogram score 370 assigns the lowest value to the better calibrated EMOS + RAFT m + ECC. The almost identical variogram scores suggest that RAFT ens manages to preserve the multivariate correlation structure throughout its multiple iterations.

RMSE CRPS Euclidean Error Energy
The average rank histograms in Fig. 6 confirm that without applying ECC, the EMOS and EMOS + RAFT m forecasts are very uncalibrated. Both the EMOS + ECC and EMOS + ECC + RAFT ens combinations show a ∪-like shape, which can be interpreted as either a too strong correlation or underdispersion. From the band depth histogram (not shown; see Thorarinsdottir 375 et al., 2016) we can conclude that the latter is the case here, as was also seen in the univariate histograms. On the other hand, the EMOS + RAFT m + ECC forecast ranks form a hump-like histogram. This is due to the correlation between the components being too weak, again confirmed by the band depth histogram.
In order to investigate further the optimal order of operation when applying multiple post-processing methods, we look at how the scores develop with every step in the RAFT process. While the scores in Table 1 are computed using the final RAFT 380 installment at t + 35, where all forecasts have been adjusted using the observation made two hours earlier, Fig. 7 shows the CRPS, energy score and variogram score computed at each RAFT iteration for the gEMOS + ECC + RAFT ens and gEMOS + RAFT m + ECC forecasts. From Fig. 7a, it is clear that EMOS + RAFT m + ECC performs best in terms of the CRPS, with the gap between the two combinations widening with increasing number of RAFT adjustments. As we have also seen that the RAFT m version is better calibrated than the RAFT ens one, this means that the CRPS here puts more weight on the calibration being correct than on the slightly better deterministic forecast (see the RMSE in Table 1) in the RAFT ens case. This is surprising, given that the CRPS and its multivariate equivalent, the energy score, are usually more sensitive to the error in the forecast mean (see Fig. 4 in Friederichs and Thorarinsdottir (2012) and Pinson and Tastu (2013)).
While the CRPS results show a clear pattern, it is not as straightforward for the energy score. The mean score decreases with every RAFT adjustment, as expected, but there is no discernible difference in the performance of the two post-processing 390 method combinations. The most complex picture emerges in the case of the variogram score, where the ranking of the two combinations actually switches around RAFT iteration 24. The variogram score is better at detecting incorrect correlation structures than the energy score, so one possible explanation would be that EMOS + ECC + RAFT ens is initially good at retaining the appropriate correlations, but that ability weakens over time. Conversely, ECC is applied after every iteration of RAFT m , which might explain the better variogram scores towards the end of the process. However, we have observed in Fig. 6 395 that the correlation structure at the last iteration is still too weak. It should also be noted that the variogram score for EMOS + RAFT m + ECC deteriorates slightly at the beginning.
Finally, we want to investigate the homogeneity of the scores across the different locations and highlight some interesting results for particular sites. In Fig. 8a, we see that RAFT m improves the CRPS for all sites as compared to the EMOS baseline.
That means that the method where the variance is not adjusted increases the deterministic and probabilistic forecast skill at 400 all sites. As we have seen from the univariate histograms in Fig. 5, even the calibration is improved. Figure 8b shows how a reduction in the mean error can have a large effect on the energy score. At 37 sites, the energy score for gEMOS + RAFT m is actually lower than the one for gEMOS + ECC. The former forecasts are lacking any form of temporal coherency among lead times, so here the deterministic improvement exceeds any benefit from reintroducing the ensemble's correlation information.
Judging from Fig. 8c, a case can be made for a site-specific RAFT approach. The mean variogram score for the gEMOS +

405
RAFT m + ECC forecasts at The Cairnwell, Scotland is higher than the score for gEMOS + ECC, meaning that we are not able to make any improvements by applying RAFT m and that there are local effects not resolved by the RAFT model.

Conclusions
Our goal was to find out in which order post-processing methods pertaining to different stages in the forecasting process should be applied. We look at three techniques, each with a different objective. EMOS is a versatile method aiming to calibrate 410 ensemble output as soon as the model run is finished, based on the ensemble's performance over the last 40 days. There are two candidates for wind speed calibration: gEMOS uses a model based on truncated Gaussian distributions and logEMOS a model based on truncated logarithmic distributions. It turns out that both models produce very similar results, with gEMOS having slightly better scores and logEMOS being a little closer to perfect calibration. Therefore it is advised to test both methods for the data set at hand and to check which distribution gives a better fit.

415
The second technique, ECC, restores the multivariate dependency structure present in the ensemble forecasts to the EMOS predictions. While conceptually and computationally easy to implement, the success of ECC relies on the NWP model getting the physical, spatial and temporal correlations between the components right. Making use of the part of a forecast trajectory that has already verified, RAFT is based on the concept that an observed error will provide information about the expected error at not-yet-realized lead times. It can be applied either to the forecast mean only (RAFT m ) or to a set of ensemble members 420 (RAFT ens ) in order to adjust both predictive mean and variance.
In essence, there are two feasible options when combining these three methods: EMOS + ECC + RAFT ens and EMOS + RAFT m + ECC. Overall, their performance might be very similar, but there are subtle differences which can lead to preferring one method over the other. The EMOS + RAFT m + ECC variant produces a lower CRPS and has better univariate calibration, although this is most likely a feature of this forecasting system only, where the EMOS forecasts are underdispersive. Naturally, 425 the RAFT ens adjusted predictive variance becomes smaller with every RAFT step, as predictability usually increases with a shrinking forecast horizon. This, however, leads to the respective distributions still being underdispersed and not able to counterbalance the deficit of the EMOS forecasts.
If multivariate coherency is of particular importance, e.g., to create plausible forecast scenarios, the EMOS + ECC + RAFT ens turns out to be the better choice, as is beats its alternative in terms of the energy score, the Euclidean error and also the RMSE,

430
while there is only very little difference in the variogram score. It is also more versatile and should be preferred for NWP ensembles exhibiting very different calibration characteristics than MOGREPS-UK.
Therefore, it is necessary to study every forecasting scenario closely, monitor how calibration methods like EMOS affect the forecast skill and identify potentially remaining deficiencies. As a rule of thumb, it can be said that the post-processing method designed to address one's particular area of interest, whether univariate or multivariate, should be applied first. So far, 435 we do not adapt RAFT to optimize forecasts at individual sites. A model tailored to specific local characteristics could involve changing the algorithm for finding the adjustment period or adding suitable predictors to the linear model. Also, particular