Statistical post-processing of ensemble forecasts, from simple linear regressions to more sophisticated techniques, is now a well-known procedure for correcting biased and poorly dispersed ensemble weather predictions. However, practical applications in national weather services are still in their infancy compared to deterministic post-processing. This paper presents two different applications of ensemble post-processing using machine learning at an industrial scale. The first is a station-based post-processing of surface temperature and subsequent interpolation to a grid in a medium-resolution ensemble system. The second is a gridded post-processing of hourly rainfall amounts in a high-resolution ensemble prediction system. The techniques used rely on quantile regression forests (QRFs) and ensemble copula coupling (ECC), chosen for their robustness and simplicity of training regardless of the variable subject to calibration.

Moreover, some variants of classical techniques used, such as QRF and ECC, were developed in order to adjust to operational constraints. A forecast anomaly-based QRF is used for temperature for a better prediction of cold and heat waves. A variant of ECC for hourly rainfall was built, accounting for more realistic longer rainfall accumulations. We show that both forecast quality and forecast value are improved compared to the raw ensemble. Finally, comments about model size and computation time are made.

Ensemble prediction systems (EPS) are now well-established tools that enable the uncertainty of numerical weather prediction (NWP) models to be estimated. They can provide a useful complement to deterministic forecasts. As recalled by numerous authors

Numerous statistical ensemble post-processing techniques are proposed in the literature and show their benefits in terms of predictive
performance. A recent review is available in

NWS data-science teams have investigated the field of ensemble post-processing with different and
complementary techniques, according to their computational abilities, NWP models to correct, data policy, and their forecast users
and targets; see e.g.

Regarding statistical post-processing for temperatures, a recent non-parametric technique such as quantile regression forests

For trickier variables where the choice of a conditional distribution is less obvious, such as rainfall,

In this paper, we present two examples of deployment of ensemble post-processing in the French NWS operational forecasting
chain in order to provide gridded post-processed fields. The two examples are complementary.

A station-based calibration using local QRF of surface temperature in western Europe of the ARPEGE global EPS

A grid-based calibration using the QRF EGP TAIL of hourly rainfall on France of the high-resolution AROME EPS

Flowchart of the temperature post-processing chain.

Flowchart of the hourly rainfall post-processing chain.

This paper is organized as follows: Sects. 2 and 3 are devoted, respectively, to the complete post-processing chain of surface temperature and
hourly rainfall, shown in two flowcharts (Figs.

We present here the French ARPEGE global NWP model, for temperature calibration.

The ARPEGE NWP model

Localization of stations on the target grid.

Predictors involved in station-based PEARP post-processing. The target variable is surface temperature.

We can assume that this data set is less abundant than in

Based on the work of

Binary decision trees are prone to unstable predictions insofar as small variations in the learning data can result in
the generation of a completely different tree. In random forests,

Two-dimensional example of a

When a new set of predictors

See, for example,

A direct application of the QRF algorithm for forecasting temperature distribution is suboptimal. Indeed, although QRF is able to return weather-related
features such as multi-modalities, alternatives scenarios, and skewed distributions, the method cannot go beyond the range of the data.
In the operational chain, the QRF algorithm is not trained with observations but with the errors between the observation and the
ensemble forecast mean. The result of Eq. (

The ensemble copula coupling method

The problem at hand is challenging.

The domain covers a large part of western Europe, from coastal regions to Alpine mountainous regions, subject to various climate conditions (oceanic, Mediterranean, continental, Alpine).

Data density is very inhomogeneous (from the high density of stations over France to the somewhat dense network over the UK, Germany, and Switzerland and the sparse density over Spain and Italy).

Interpolation has to be extremely fast, since more than 1824 high-resolution spatial fields have to be produced in a very short time.

Therefore, a new technique has been developed, very similar to “regression kriging”, based on the following principle: at each station location, perform a regression between post-processed temperatures and raw NWP temperatures, using additional
gridded predictors as well. The resulting equation is then applied to the whole grid to produce a spatial trend estimation.
Regression residuals at station locations are then interpolated. Spatial trend and interpolated residuals are summed to produce
the resulting field. Interpolation of residual fields is performed using an automated multi-level B-spline analysis

Several studies have investigated the complex relationships between topography and meteorological parameters; see e.g.

Topographical parameters include altitude, distance to coast and additional parameters
computed following the AURELHY method

Altitude

For the interpolation of climate data, most of the time only topographic data are available, which may play the role of ancillary data in estimating the spatial trend. In our case, another important source of information is provided
by the NWP temperature field at the corresponding lead time for each member. As such, PEARP data may not be directly used, since their resolution is coarser than the target resolution (7.5

Domains used for spatialization of post-processed temperatures.

Since interpolation is to be performed on a very large domain, with greatly varying data density,
several regressions are computed on smaller sub-domains denoted by

For a given base time

We denote

The model estimation of parameters

We aim to use an exact, automatic and fast interpolation method for residual interpolation. Although TPS and kriging may be computed in an automated way, those methods do not meet our criteria in terms of computation time.

While not strictly an exact interpolation method, the MBA algorithm was chosen as
it is an extremely fast algorithm. Furthermore, the degree of smoothness and exactness
of the method may be precisely controlled, as recalled by

A precise description of this method is beyond the scope of this article.
We just briefly recall that the MBA algorithm relies on a uniform bicubic B-spline surface passing through the set of scattered data to be interpolated. This surface is
defined by a control lattice containing weights related to B-spline basis functions, the sum of which allows surface approximation. Since there is a tradeoff between smoothness and accuracy
of approximation via B splines, MBA takes advantage of a multiresolution algorithm. MBA uses a hierarchy of control lattices, from coarser to finer, to estimate a sequence of B-spline approximations whose sum achieves the expected degree of smoothness and accuracy.
Refer to

During testing, we found out that 13 approximations were sufficient to ensure
a quasi-exact interpolation (magnitude of error, around 0.0001

An important point at the practical level is that the interpolation of
residuals is performed only once on the whole grid.
We found that undesirable boundary effects could appear at the edges of domains

Results of PEARP post-processing of temperature in the 2056 EURW1S100 stations with average CRPS

We present here the results of the post-processing of PEARP temperature in EURW1S100 stations. The hyperparameters for QRF are
derived from

The gain in CRPS is obvious after calibration, whatever the base and lead times. Moreover, the hierarchy among base times is maintained. In both panels b and c, post-processed ensembles are unbiased and well dispersed, in contrast to raw ensembles, which exhibit (cold with diurnal cycle) bias and underdispersion. Nevertheless, we notice that post-processed distributions show a slight underdispersion at the end of lead times. This is due to the absence of
predictors coming from the deterministic ARPEGE model. These predictors do not relate directly to temperature, and thus the addition of weather-related predictors
is crucial here for uncertainty accounting. We believe that radiation predictors are most important here, since the presence or absence of these predictors
is linked to the “roller coaster” behaviour of post-processed PIT dispersion around a 3

Prior to any use in the spatialization of post-processed PEARP fields, performances of the interpolation method were evaluated for deterministic forecasts.

This paragraph is devoted to the evaluation of an earlier version of the current spatialization algorithm over France, which differs only in the fact that NWP temperature fields are not available in the predictor set for spatial trend estimation. Benchmarking data consist of 100 forecasts. For each date, 20 cross-validation samples are randomly generated, removing 40 points from the full set of points. Original forecast values and interpolated forecast values are then compared, and standard scores (bias, root mean square error, mean absolute error, 0.95 quantile of absolute error) are computed. Scores are then compared to the COSYPROD interpolation scheme, the previous operational interpolation method. COSYPROD is a quick interpolation scheme which predates both the first and current versions of our algorithm, adapted to interpolation at a set of some production points and derived from the IDW method.

Results show that, regardless of the method, bias remains low, but the new spatialization method outperforms COSYPROD in terms of root mean square error, mean absolute error, and 0.95 quantile of absolute error (Fig.

Boxplots of bias

Additionally, the described spatialization procedure has already been used operationally
for the interpolation of deterministic temperature forecasts since May 2018. In this application,
its performances were evaluated routinely over a large set of climatological station data,
which only measure extreme temperatures and do not provide real-time data. Hence, this data set is discarded from any post-processing, but may serve as an independent data set for validation. When comparing forecast performances related to this data set, the increase in root mean square error is around 0.3

Step-by-step procedure illustrated over the south-east of France: raw member temperatures on a 7.5

Resulting field over the south-east of France.

Raw PEARP member 6 temperature field

An illustration of the whole procedure is illustrated on PEARP temperatures of base time
3 October 2019, 18:00 UTC, for lead time 42

Note that during the full processing, field values were modified during the calibration process. But ECC and interpolation are able to maintain the main features of the original field; i.e. it is the passage of a front, which is not situated in the same location for both members.

We present the high-resolution limited model area NWP model AROME, for the post-processing of hourly rainfall.

The AROME non-hydrostatic NWP model

We reduce spatial penalty issues due to the high resolution of the raw EPS

We improve ensemble sampling and, we hope, the quality of predictors.

We reduce computational costs by a factor 25.

Predictors involved in HCA-based PEAROME post-processing. The target variable is hourly rainfall.

The number of predictors is less abundant here than in

Note in Eq. (

The anomaly-based QRF approach is not employed for hourly rainfall. We believe that the choice of a centering variable is as difficult as choosing a good parametric distribution for predictive distributions. In the case of hourly rainfall, the adjustments are not relative to the method, but rather the construction of the training data.

For each HCA, we consider predictors calculated with the 400-member pseudo-ensemble. For each HCA of size

As already observed by

In our case, 400 values have to be attributed to the 16 members of the 25 grid points of the HCA. The procedure, called bootstrapped-constrained ECC (bc-ECC),
is as follows.

If

If not, we perform ECC many times (here 250 times per HCA) and average values.

Then, a raw zero becomes a non-zero only if there is a raw non-zero in a 3 raw grid-point neighbourhood.

Example of bc-ECC (

As a result, in a member, post-processing can introduce rain in a grid box that is dry in a raw member only if there is a grid point with rain close by in the raw member. This approach ensures coherent scenarios between post-processed rainfall fields and raw cloud cover fields, for example.

Due to the high amount of data to process for evaluation (around 200

The averaged CRPS between the raw and post-processed ensembles is improved by approximately 30 % (from 0.118 to 0.079).

Receiver operating characteristic (ROC) curve and reliability diagram

Figure

As for increased precipitation, the focus is placed on forecast value. Figure

Maximum of the Peirce skill score among thresholds; the improvement is mainly due to the improvement of the hit rate. The validation is made by a 2-fold cross-validation on the 2 years of data (one sample per year).

CRPS of daily distributions during October 2019.

Daily rainfall in between raw and post-processed ensembles was compared in the pre-operational chain during October 2019.
In Fig.

Illustration of a heavy precipitation event. On the left, rainfall accumulated on 22 October 2019, with peaks over 300

We then seek to determine whether calibrated hourly intensities lead to unrealistic or worse daily rainfall intensities than the raw ensemble. In other words,
does the bc-ECC generate coherent scenarios? First, in Fig.

The two applications described in this article (PEARP temperature and PEAROME rainfall post-processing)
are extremely computationally demanding and therefore could not be run on standard workstations within an acceptable timeframe.
While codes are implemented on Météo-France's supercomputer, a crucial optimization
phase must still be achieved, as two problems had to be solved during the implementation phase.

The very large number of high-resolution fields required, since for each lead time, not only statistical fields (quantiles, mean, standard deviation fields), but also calibrated member fields are computed.
This was achieved using inexpensive but efficient methods, such as ECC and MBA, and a massive parallelization of operations, thanks to R High Performance Computing capabilities. The operational code relies on parallel, foreach, DoSNOW, and DoMC packages that enable OpenMP multicore and MPI multinode capabilities. The number of cores used in each node is driven by memory occupation of each process. For example, PEARP temperature uses 4 HPC nodes in 25

The huge size of objects produced by quantile regression forests.
For a given base time, PEARP temperature application requires around 300

In this article, we show that machine learning techniques allow a very large improvement of probabilistic temperature forecasts – a well-known result that can also be achieved with simpler methods such as EMOS. But while EMOS outputs follow simple and fixed parametric distributions, QRF produces distributions that may preserve the richness of the initial ensemble. Also, a simple method such as ECC coupled with our spatialization algorithm is able to restore realistic high-resolution temperature fields for each member.

Moreover, HCA-based QRF calibration is able to calibrate efficiently a much trickier parameter such as hourly rainfall accumulation – for which the signal for extremes is of special importance and provides realistic rainfall patterns that match initial members.

In the context of forecast automation, it is important to identify the end users and their expectations in order to choose a method that balances complexity and
efficiency. In the same vein, minimizing an expected score may be less important than reducing big (and costly) mistakes. For example, the European Center for Medium-Range Weather Forecasts (ECMWF) recently added the frequency of large errors in ensemble forecasts of surface temperature as a new headline score

Of course, applicability of those methods is not restricted to temperature and rainfall. For any parameter
that can be interpolated rather easily (humidity for example), our “temperature scheme”, that is, calibration on station locations, ECC and spatialization may be applied. This approach is much less greedy in terms of computation time and disk storage. In addition,

The only limitations of post-processing are the availability of good gridded observations or a sufficiently dense station network, and the existence of relevant predictors produced by NWP. Those conditions may not yet always be fully achieved for parameters that remain challenging, such as visibility, for example.

Finally, as recalled in the discussion, production of high-resolution post-processed fields with such techniques has proven to be extremely demanding
in terms of CPU and disk storage. Moving the post-processing chain
to supercomputers is a challenging but fruitful investment:
the learning phase that could take weeks is now achieved in a few hours.
This provides extra possibilities for tuning parameters of powerful or promising statistical methods; as mentioned in

The research data come from the operational archive of Météo-France, which is free of charge for teaching and research purposes. Due to its size, we cannot deposit the data in a public data repository. You can find the open data services at

The supplement related to this article is available online at:

MT developed the station-wise post-processing of PEARP and the post-processing of PEAROME with bc-ECC. OM developed algorithms of interpolation of scattered data and ECC for temperatures. OM configured the operational chain for temperature. OM and MT currently configure the operational chain for rainfall. OM made figures for temperature. MT created the figures for rainfall and scores. OM and MT wrote the publication, each rereading the other's part.

Maxime Taillardat is one of the editors of the special issue.

This article is part of the special issue “Advances in post-processing and blending of deterministic and ensemble forecasts”. It is not associated with a conference.

The authors would thank the COMPAS/DOP team of Météo-France, and more particularly Harold Petithomme and Michaël Zamo for their work on R codes. The authors would also like to thank Denis Ferriol for his help during the set-up of R codes on the supercomputer.

This paper was edited by Sebastian Lerch and reviewed by Jonas Bhend and one anonymous referee.