Calibrated ensemble forecasts of the height of new snow using quantile regression forests and Ensemble Model Output Statistics

Height of new snow (HN) forecasts help to prevent critical failures of infrastructures in mountain areas, e.g. transport networks, ski resorts. The French national meteorological service, Météo-France, operates a probabilistic forecasting system based on ensemble meteorological forecasts and a detailed snowpack model to provide ensembles of HN forecasts. These forecasts are however biased and underdispersed. As for many weather variables, post-processing methods can be used to alleviate these drawbacks and obtain meaningful 1-day to 4-day HN forecasts. In this paper, we compare the skill of two 5 post-processing methods. The first approach is an ensemble model output statistics (EMOS) method, which can be described as a Nonhomogeneous Regression with a Censored Shifted Gamma distribution. The second approach is based on quantile regression forests, using different meteorological and snow predictors. Both approaches are evaluated using a 22-year reforecast. Thanks to a larger number of predictors, the quantile regression forest is shown to be a powerful alternative to EMOS for the post-processing of HN ensemble forecasts. The gain of performance is large in all situations but is particularly marked 10 when raw forecasts completely miss the snow event. This type of situations happens when the rain-snow transition elevation is overestimated by the raw forecasts (rain instead of snow in the raw forecasts) or when there is no precipitation in the forecast. In that case, quantile regression forests improve the predictions using the other weather predictors (wind, temperature, specific humidity).

The forecasts are obtained by a chain of ensemble numerical simulations. 10-member reforecasts of the PEARP ensem-60 ble NWP (Descamps et al., 2015;Boisserie et al., 2016) are downscaled by the SAFRAN system (Durand et al., 1999) to obtain a meteorological forcing adjusted in elevation. The Crocus multilayer snowpack model, part of the SAFRAN -SURFEX/ISBA-Crocus -MEPRA (S2M ) modelling chain , is forced by these forecasts to provide ensemble simulations of HN accounting for all the main physical processes explaining the variability of HN for a given precipitation amount: the dependence of falling snow density on meteorological conditions, the mechanical compaction over time depending 65 on snow weight, the microstructure and wetness of the snow, a possible surface melting, and so on. The forecasts used in this paper are the same as used by Nousu et al. (2019) who provided more details on the models' configurations. Table 1 presents the selected predictors based on the available reforecasts. This selection is derived from studies of Scheuerer and Hamill (2015) and Taillardat et al. (2019) for rainfall. It includes summary statistics and probabilities of the variable to be predicted (rainfall in the previous references transposed into HN in our case). We also consider statistics of other weather 70 variables of the ensemble suspected to add predictability because they might affect the statistical relationship between observed and simulated HN. In this paper, for each station, we thus consider i = 1, . . . , n = 3218 days with an observed HN Y i (the response) and a vector of corresponding predictors X i .

Ensemble model output statistics
75 Among the ensemble model output statistics (EMOS) methods available, non-homogeneous regression approaches are the most common and were originally based on Gaussian regressions whose mean and variance are linear functions of ensemble statistics (Gneiting et al., 2005;Wilks and Hamill, 2007). Non-homogeneous regression methods can also incorporate climatological properties, and additional predictors. For meteorological predictands such as rainfall and snow, however, the high number of zero-values motivate the use of a discrete-continuous distribution with a mass of probability at zero. In this study, we use a 80 regression based on the zero-censored censored shifted-gamma distribution (CSGD, see Hamill, 2015, 2018;Nousu et al., 2019).
The non-homogeneous regression method applied in this study is similar to the approach presented in Nousu et al. (2019) and is referred to as EMOS hereafter. Further details of this EMOS method are presented in Appendix A. More precisely, we detail how the CSGD is used to represent the predictive distribution of daily HN forecasts, the parameter estimation method, 85 and the related predictive distribution. Please note that expression (A3) is slightly different from expression (4) in Nousu et al. (2019) and strictly corresponds to Scheuerer and Hamill (2018) (see their expression of σ in section 3a, p. 1653). While this difference is not critical on the performances, expression (A3) avoids scaling issues in parameter β 2 .

Quantile regression forest
Compared to the EMOS method, quantile regression forest is expected to incorporate any predictor without degrading the 90 quality of the predictions. Subsets of the space covered by the predictors are created in order to obtain homogeneous groups of observations inside these subsets. If the predictors include many meteorological forecasts, these subsets are expected to describe different meteorological situations. Compared to EMOS, this so-called non-parametric regression does not assume a particular distribution for the predictors or the response, and empirical distributions represent the uncertainty about the prediction.

95
The QRF method presented in this paper are based on the construction of binary decision trees, as proposed by Meinshausen (2006). These decision trees (CART; Breiman et al., 1984) :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: (Classification and Regression Trees or CART for short; Breiman et al., 1984) are built by iteratively splitting each predictor space (D 0 ) into two groups (D 1 and D 2 ) according to some threshold. The predictor and the threshold are chosen in order to maximize the homogeneity of the corresponding values of the response (here the observed HN) in each of the resulting groups, i.e. we want to minimize the sum of variances of the response variable within 100 each group: where Y andȲ corresponds to the response sample and its mean in D j , respectively. The optimal threshold s maximizes: where T * is a random subset of the predictors in the predictors' space T . These trees are obtained by bootstrapping the 105 training data, which justifies the name of "random forest" since each split of each tree is built on a random subset of the predictors (Breiman, 2001). The final "leaf" corresponds to the group of predictors at the end of each tree (see Fig. 1 in Taillardat et al., 2019, for an illustration).

Implementation
The QRFs are obtained using the function quantregForest of the package quantregForest in R (R, 2017). The 110 splitting procedure described above can be constrained by different choices, e.g. a minimum number of observations in leaves.
In this paper, we grow 1000 trees which represent a sufficiently large number of trees to span a great variety of meteorological situations without demanding an unbearable computational effort. Different values for the parameters mtry, which specifies the number of predictors randomly sampled as candidates at each split (usually small, i.e. less than 10) and nodesize which defines the minimum number of cases (days) in terminal nodes, have been tried, the best performances being found for mtry=2 115 and nodesize=10 (see Section 6).

Predictive distribution
For QRF, the predictive distribution given a new set of predictors x is the conditional cumulative distribution function (CDF) introduced by Meinshausen (2006):

120
where the weights w i (x) are deduced from the presence of Y i in a final leaf of each tree when one follows the path determined by x. In practise, the resulting forecast is a set of quantiles fromF (y|x) obtained with the function predict.quantregForest from the R package quantregForest. Different quantiles are thus computed for synthetic graphical representations or for score calculations.

125
This Section details the process applied to assess the performance of the different approaches. Classical evaluation metrics include the continuous ranked probability score (CRPS) which sums up the forecast performance attributes in terms of both reliability and resolution :::::::: sharpness simultaneously (Murphy and Winkler, 1987;Hersbach, 2000;Candille and Talagrand, 2005). Rank histograms are also a common tool to assess systematic biases and over/under dispersion.

Cross-validation 130
For all the experiments of this study, we use a leave-one season out cross-validation scheme. For each of the 22 seasons, one season is used as a validation dataset while the other 21 seasons are used for training. It first ensures that a robust calibration of the post-processing methods is obtained. It also avoids the evaluation of the performances with a unique validation period that could be atypical (e.g. a very snowy/dry winter season).

135
The CRPS is one of the most common probabilistic tools to evaluate the ensemble skill in terms of reliability (unbiased probabilities) and resolution :::::::: sharpness (ability to separate the probability classes). For a given forecast, the CRPS corresponds to the integrated quadratic distance between the cumulative distribution function (CDF) of the ensemble forecast and the CDF of the observation. Commonly, the CRPS is averaged over n days:

140
where F i (x) is the CDF obtained from the ensemble forecasts for day i, Y i is the corresponding observation, and H(z) is the The CRPS value has the same unit as the evaluated variable and equals to 0 for a perfect system.
For the EMOS-CSGD model described above, an analytic formulation of the CRPS is available (Scheuerer and Hamill, 2015) and a correct CRPS estimation is directly obtained.

145
In other cases, a correct evaluation of the CRPS defined in Eq.
(2) can be difficult. For example, the raw ensemble does not provide a forecast CDF, but only a very limited ensemble of values. In this case, the CRPS is estimated with some error. In this study, we apply the recommendations given by Zamo and Naveau (2018). More specifically, when the forecast CDF is known only through an M -ensemble x 1 , . . . , x M , we apply the following definition to estimate the instantaneous CRPS (i.e. for one ensemble): where y is the observation corresponding to the forecast ensemble. This expression is evaluated with the function crpsDecomposition of the R package verification. In the case of the instantaneous CRPS of the raw ensemble forecasts, Eq. 3 is applied directly, while some refinements can be made to improve the estimated CRPS in the case of QRF, which provides a much larger number of different quantiles (so-called order) than what is available in the raw ensemble. Unfortunately, 155 the set of possible quantiles and their corresponding order cannot be known a priori, which represents an additional difficulty. To evaluate instantaneous CRPS values for QRF, we thus use the recommendations by Zamo and Naveau (2018), i.e.
we use the average CRPS INT given in Eq.
(3) with linearly interpolated regular quantiles between unique quantiles. The so-

Sharpness
While the CRPS are often used to verify the overall quality of the predictive distributions, it can also interesting to assess the sharpness of the predictions. Gneiting et al. (2007) propose to look at the width of the predictive intervals for different nominal coverages (e.g. 50%, 90%).

165
The reliability of ensemble forecast systems can be assessed using rank histograms (Hamill, 2001). If the predictive distributions obtained with the different post-processing methods are adequate, then the CDF values of the predictive distributions for the observations should be uniformly distributed (so-called Probability Integral Transform, PIT). The flatness of the histogram of these CDF values is a necessary but not sufficient condition of the system reliability. Systematic biases are detected by strongly asymmetric rank histograms. It is also an indicator of the spread skill as underdispersion will result in a U-shaped 170 rank histogram and overdispersion in a bell-shaped rank histogram. Rank histograms can be computed for the whole forecast dataset, or stratified according to different classes of average ensemble forecasts (stratifying according to the observations leading erroneous conclusions, see Bellier et al., 2017). In this study, as proposed in the recent study of Bröcker and Bouallègue

ROC curves
Finally, the "relative operating characteristic" (ROC) curves (Kharin and Zwiers, 2003) can be used to assess the quality of probability forecasts by relating the hit rate (probability of detecting an event which actually occurs) to the corresponding false-alarm rate (probability of detecting an event which does not occur).

6 Results
We first discuss the application of the QRF methods with regards to the parameters mtry (number of predictors randomly sampled as candidates at each split) and nodesize (minimum number of days in terminal nodes X j are applied in order to verify how well the response Y can be predicted with this deterioration. When an "important" permuted variable X j and unpermuted predictor variables are used to predict the response, the prediction accuracy, quantified here with the standard deviation of the response variable ::: the ::: sum ::: of :::::: squares :: of ::: the :::::::::: differences ::::::: between :::::::: predicted ::: and :::::::: observed ::::::: response :::::::: variables, decrease substantially. The variable importance is the difference in prediction accuracy before and after permuting X j and is implemented by the function importance of the package randomForest in R (see, e.g. Louppe 195 et al., 2013, for further details).
For a 1-day lead time, the Q90 of the forecast snow rate, followed by the Q90 of raw forecasts of HN, are the most important predictors. The most important predictors are directly related to snow quantities, and the role of other meteorological forcings is minor. As the lead time increases, the importance of the snow predictors decreases while the importance of the forecast rain rate gets larger. In particular, the Q90 of the snow and rain rates are the two most important predictors at a 4-day lead time.

205
-Predictive intervals obtained with the post-processing methods are large and look very similar. Observations generally lie within these intervals (with one major exception at the end of March ::: the ::::: period).
-When the raw reforecasts are all equal to zero, the EMOS method mechanically predicts zero HNs, which is often verified (see, e.g. on the 5th, 6th and 11th of March). However, EMOS predicts these zero values with a 100% probability, while QRF predicts small intervals in this example, which avoids failures (i.e. prediction of a zero value with absolute certainty 210 while a positive HN value is observed). In this example, it happens for two days, on the 7th and 9th of March. 2012. An large observed HN of 40 cm occurred on the 10th of April. EMOS method completely misses this event, because no snow was present in the raw forecasts. In this case, QRF predicts a large interval, with a 90th percentile around 20 cm. Looking at the raw forecasts of the meteorological forcings for this day, the 90% CI :::::::   prediction ::::::: systems, the :::: lower ::: and ::::: upper ::::: curves ::::::: represent ::: the 10th and 90th percentiles and the ::::::::: respectively. ::: The : solid black line represents the time series of the HN observations. Figure 5 shows the 2024 CRPS values averaged over the different winter seasons (92 stations × 22 winter seasons) obtained with the raw reforecasts, and with EMOS and QRF post-processing methods, for a 1-day lead time (left plots). While EMOS 220 gives a considerable gain of performance, it is still outperformed by the QRF method. The second column :::: right ::::: panel quantifies this improvement as a percentage, in terms of relative CRPS. For most of the stations, EMOS shows a degradation of the performances between 20% and 30%, up to 40% compared to QRF. Results (not shown) are very similar for the other lead times.  Table 2 reports the mean width and the corresponding standard deviation of the predictive intervals (50% and 90% CI ::::::: nominal 225 :::::::: coverages) over all locations and dates, for a 1-day lead time, with the different methods. As indicated above and illustrated in Figs 3-4, the predictive intervals obtained with the raw ensembles are a lot thinner than with EMOS and QRF, but they are under-dispersed. The sharpness of the post-processed ensembles are very similar, the mean width for a 50% probability being around 2.5 cm, and 9 cm for a 90% probability. Table 2. Mean and standard deviation (SD) of the width of the predictive intervals ( :: PI, : 50% and 90% nominal coverages), with the different methods and for all locations and dates, for a 1-day lead time. The width associated to a 50% probability corresponds to the difference between the 25th and 75th percentiles and the width for a 90% probability corresponds to the difference between the 5th and 95th percentiles.  Figure 6 shows the rank histograms of HN with the raw forecasts and with EMOS and QRF post-processing methods. As 230 indicated in previous studies (see, e.g. Nousu et al., 2019), raw forecasts are clearly underdispersed, leading to a U-shape rank histograms, and also usually underestimate large HN values (over-representation of the last class). These defaults are particularly visible for classes of raw ensemble averages above 10 cm (rows 2-3). The rank histogram with the EMOS method is almost perfectly flat for the small ensemble/observation averages ([0, 10) cm). For larger classes of events, it seems that EMOS predictive distribution is slightly underdispersed. The QRF method shows better performances than EMOS in that 235 regard, the only limitation being an underestimation of the largest snowfalls (see last class for HN > 30 cm in the bottom-right plot).   Figure 7 shows the ROC curves for three categories of HN observed values: All snow events (HN greater than 1 cm, 19% of the observed cases), "moderate" snow events (HN greater than 10 cm, 5% of the observed cases) and "rare" snow events (HN greater than 30 cm, 1.4% of the observed cases). Figure 7a shows that the raw forecast ensemble performs almost as well as 240 post-processed ensembles when all snow events are considered. For this category, the purple curve corresponding to the QRF approach deviates farther away from the no-skill diagonal than the green curve corresponding to the EMOS method, indicating a better skill of the QRF approach. For moderate snow events (Fig. 7b), while QRF and EMOS show similar performances, the ROC curve corresponding to raw ensembles is close to the diagonal and indicates almost no skill. For rare and intense snow events exceeding 30 cm of fresh snow in one day, Fig. 7c shows a slight gain of performance with the QRF approach compared 245 to EMOS.  Cases corresponding to precipitation phase errors (at least one member with rain in the forecasts but no snow while a positive HN has been measured, Figure 8b) represents 1.2% of all cases. Obviously, cases with snow in the forecasts and a positive HN are more frequent (8.9% and 14.7% for cases (c) and (d), respectively). Overall, while QRF outperforms EMOS in all cases (as outlined in Fig. 2), we see that the gain of performances is particularly marked for cases (a) and (b), i.e. when there is no (rain, temperature, etc.) can be exploited to overcome the limitations of the snow forecasts for the prediction of observed HN (see further discussion in Section 7 below).  In this paper, we compare the scores of post-processed forecasts of the 24h height of new snow between two commonly used statistical methods: EMOS and QRF. With this dataset, the added value of QRF is unambiguous with a general improvement of CRPS, an improvement of rank diagrams for severe snowfall events, and a slight improvement of ROC curves for more common events. The predictors selected by the QRF training clearly suggest that the simulated HN from the Crocus snow cover model is useful but not sufficient to optimize the post-processed forecasts as the meteorological variables forcing the 265 snow cover model are also selected by the algorithm. The added value coming from these meteorological predictors is the most likely explanation of the improvement obtained between QRF and EMOS. This improvement is frequent in various situations and the physical reason for which the simulated HN does not translate all the predictive power of the meteorological forcings is probably not unique, but can be partly explained by the presence of precipitation phase errors.

Raw forecasts
It must be noticed that the EMOS-CSGD model applied in this study only uses forecasts of the variable of interest as 270 predictors. Different EMOS extensions can include more predictors, in particular the boosting extension (Messner et al., 2017).
Schulz and Lerch (2021) compare a gradient-boosting extension of EMOS (EMOS-GB) to many machine learning methods for postprocessing ensemble forecasts of wind gusts, using a truncated logistic distribution. The performances of EMOS-GB and other machine learning-based postprocessing methods are promising. In particular, the distributional regression network (Rasp and Lerch, 2018) and the Bernstein quantile network (Bremnes, 2020) often outperform all the other methods, including 275 QRF. These recent models need however to be adapted to HN forecasts, i.e. using a zero-censored distribution with possibly long tails such as the CSGD.

Role of precipitation phase errors in the added value of QRF
The examples selected for illustration suggest that phase errors (or in other words errors in the rain-snow transition elevation) is one of the possible explanations for the insufficient predictive power of the simulated HN. Indeed a number of observed 280 snowfall events are simulated with a zero value in terms of HN, sometimes for all members, but with a large precipitation amount. EMOS is not able to consider these days with a large probability of positive HN because they are identical to dry days when considering only this predictor, whereas the other predictors considered by QRF (total precipitation, air temperature) can help to discriminate the days with an error in phase but with forecast precipitation and relatively cold conditions from dry days or warm days. This assumption is difficult to be statistically generalized due to the large variety of situations, i.e. 285 errors in precipitation phase often concern only a part of the total duration of a snowfall event and/or a part of the simulation members. Nevertheless, our classification of CRPSS ::::: CRPS : depending on rainfall and snowfall occurrence shows a systematic improvement of CRPS by QRF for the cases where an error in the rain-snow transition elevation is the most obvious (e.g. observed snowfall with simulated rainfall but no simulated snowfall during the whole day for all members, Fig. 8b).
The sensitivity of snow cover models to errors in precipitation phase was already illustrated by Jennings and Molotch (2019) forcing comes from NWP forecasts. The reduction of phase errors in atmospheric modelling is beyond the scope of this paper.
However, an improvement of post-processed forecasts might be opened by considering predictors more directly related to this phase issue. In particular, interviews of operational weather forecasters show that expert HN forecasts strongly rely on the 1 • C isothermal level in terms of pseudo-adiabatic wet-bulb potential temperature (θ w ). Unfortunately, this diagnostic was 295 not available in the PEARP reforecast, but this feedback encourages future reforecast productions to include this additional diagnostic as the post-processing might be able to more directly account for phase errors with such a predictor. More simply considering the surface wet-bulb temperature is also increasingly done in land surface modelling for phase discrimination (Wang et al., 2019) and it may also be an easier alternative predictor for statistical post-processing, although the information content of the simulated atmospheric column is probably better summarized by the pseudo-adiabatic wet-bulb temperature 300 iso-θ w (WMO, 1973). Nevertheless, forecasters also mention that a common limitation of NWP models is their inability to simulate the unusually thick 0 • C isothermal layers encountered in some intense storms, up to 1000 m. The complex interactions between the processes involved in this phenomenon are only partly understood (latent cooling from melting precipitation and evaporation/sublimation, melting distance of snowflakes, adiabatic cooling of rising air, specific topographies, blocked cold air pockets, etc. Minder et al., 2011;Minder and Kingsmill, 2013). In these specific cases, even the level θ w = 1 • C is considered 305 to be a poor predictor of the rain-snow transition elevation. These situations are often the most critical in terms of impacts (wet snow at low elevations affecting the roads and the electrical network) but their very low frequency will remain a severe challenge even with statistical post-processing.

Limitations for operational perspectives
In order to investigate the potentials of the statistical methods themselves regardless of the constraints on the available dataset, forecasts). Therefore, despite the large added value of QRF compared to EMOS with consistent and homogeneous datasets for calibration and evaluation, its practical implementation in real-time operational forecasting products is still a challenge because reforecasts strictly identical to operational configurations are often not available. Time adaptative training based on operational systems is an alternative to favour the homogeneity of the dataset. Although new theories are emerging to face the challenge of model evolutions (Demaeyer and Vannitsem, 2020), several consistent recent studies show that the length of 320 the calibration period is more critical than the strict homogeneity of datasets to forecast rare events (Lang et al., 2020;Hess, 2020). In the case of HN forecasts from EMOS (Nousu et al., 2019), even a 4-year calibration period was detrimental for the reliability of severe snowfall events compared to a longer heterogeneous reforecast. However, Taillardat and Mestre (2020) manage to successfully implement QRF in real-time forecasting products of hourly precipitation using a calibration limited to 2-year operational forecasts, because they adapt the distribution tail with a parametric method. Producing reforecasts more 325 homogeneous with operational forecasts is still one of the most promising solution to improve the forecast probabilities of severe events but the evolutive skill of NWP systems is strongly linked to the available data to assimilate and will never be completely removed. Therefore, the robustness of post-processing algorithms for their transfer to operational dataset or their efficiency when calibrated with shorter datasets will always remain the most important criteria compared to their theoretical added values with perfect and long datasets. This is therefore a major point to consider to transpose the advances of this paper 330 towards operational automatic HN forecasts.
Code and data availability. The R code used for the application of the EMOS approach is based on different scripts originally developed Appendix A: Ensemble model output statistics for post-processing of ensemble forecasts of the daily HN A1 Zero-censored Censored Shifted-Gamma regression Here, the Zero-censored Censored Shifted-Gamma regression distribution (CSGD) is used to represent the predictive distribution of daily HN forecasts, and is defined as:

350
where k, θ and δ are shape, scale and shift parameters, respectively, and G k is the CDF of a standard gamma distribution with unit scale and shape parameter k. The shape parameter k and scale parameter θ are directly related to the mean µ and the standard deviation σ of the gamma distribution through the relations µ = kθ and σ 2 = kθ 2 . Scheuerer and Hamill (2018) propose a non-homogeneous regression based on the CSGD which combines a CSGD representing the climatology of past observations. For a given day, the parameters µ, σ and δ of the predictive CSGD are related to the climatology and to the raw 355 forecast ensemble with the following expressions (Scheuerer and Hamill, 2018, Section 3.a.): µ = µ cl α 1 log1p expm1(α 1 ) α 2 + α 3 POP + α 4x , (A2) where log1p(u) = log(1 + u), and expm1(u) = exp(u) − 1. The shift parameter δ is fixed at its climatological value δ cl . This 360 regression model only employs the statistical properties of HN ensemble forecasts, summarized by its ensemble meanx, the probability of having a positive value POP, and the ensemble mean difference MD (a metric of ensemble spread), as defined by the following equations: with x m the raw HN forecast of each member m among the M members, and I xm>0 = 1 if x m > 0, and 0 otherwise.

A2 Parameter estimation
For each station, the 6 parameters {α 1 , α 2 , α 3 , α 4 , β 1 , β 2 } in Eqs A2-A4 are estimated by optimizing the CRPS prediction skill on the training dataset. As CRPS can be directly expressed in the case of a CSG distribution ::::: CSGD : (when F i =G k,θ,δ ), this 370 score can be easily minimized for this EMOS model. Complete expressions of the CRPS and details about model fitting are given in Scheuerer and Hamill (2015).