Improving the Potential Accuracy and Usability of EURO-CORDEX Estimates of Future Rainfall Climate using Mean Squared Error Model Averaging

Probabilities of future climate states can be estimated by fitting distributions to the members of an ensemble of climate model projections. The change in the ensemble mean can be used as an estimate of the unknown change in the mean of the distribution of the climate variable being predicted. However, the level of sampling uncertainty around the change in the ensemble mean varies from case to case and in some cases is large. We compare two model averaging methods that take the uncertainty in the change in the ensemble mean into account in the distribution fitting process. They both involve fitting 15 distributions to the ensemble using an uncertainty-adjusted value for the ensemble mean in an attempt to increase predictive skill relative to using the unadjusted ensemble mean. We use the two methods to make projections of future rainfall based on a large dataset of high resolution EURO-CORDEX simulations for different seasons, rainfall variables, RCPs and points in time. Cross-validation within the ensemble using both point and probabilistic validation methods shows that in most cases predictions based on the adjusted ensemble means show higher potential accuracy than those based on the unadjusted 20 ensemble mean. They also perform better than predictions based on conventional Akaike model averaging and statistical testing. The adjustments to the ensemble mean vary continuously between situations that are statistically significant and those that are not. Of the two methods we test, one is very simple, and the other is more complex and involves averaging using a Bayesian posterior. The simpler method performs nearly as well as the more complex method.

. If required, distributions can then be fitted to these ensembles to produce probabilistic predictions. The probabilities in these predictions are conditional probabilities and depend on the assumptions behind the climate model projections, such as the choice of RCP (Moss, et al., 2010;Meinshausen, et al., 2011), and the choice of 30 models and model resolution. Converting climate projection ensembles to probabilities in this way is helpful for those applications of climate projections for which the impact models can ingest probabilities more easily than they can ingest individual ensemble members. An example would be the catastrophe models used in the insurance industry, which quantify climate risk using simulated natural catastrophes embedded in many tens of thousands of simulated versions of one year (Kaczmarska, et al. (2018), Sassi, et al. (2019)). Methodologies have been developed by which these catastrophe model 35 ensembles can be adjusted to include climate change, based on probabilities derived from climate projections .
We will consider the case in which distributions are fitted to changes in climate model output, rather than to absolute values.
When fitting distributions to changes in climate model output, the change in the ensemble mean can be used as an estimate of the unknown change in the mean of the distribution of the real future climate. Model weights or bias corrections may be 40 included at this point (Sanderson, et al. (2015b), Knutti, et al. (2017), Chen, et al, (2019)). However, because climate model ensembles are finite in size, the ensemble mean suffers from estimation uncertainty. A number of studies have investigated the various uncertainties in climate ensembles, including estimation uncertainty (such as Deser, et al. (2010), Thompson, et al. (2015)) and various methods have been developed for the post-processing of ensembles to help understand these uncertainties and take them into account, addressing issues such as how to break the uncertainty into components (Hawkins 45 and Sutton, (2009), Yip, et al. (2011), Hingray and Said (2014)), how to identify forced signals given the uncertainty (Frankcombe, et al. (2015), Sippel, et al. (2019), Barnes, et al. (2019) and Wills, et al. (2020)), and how quickly signals emerge from the noise given the uncertainty (Hawkins and Sutton (2012), Lehner, et al. (2017)).
In this article, we explore some of the implications of estimation uncertainty around the ensemble mean in more detail.
Ensemble mean estimation uncertainty varies by season, variable, projection, time and location. In the worst cases, the 50 uncertainty may be larger than the change in the ensemble mean itself, and this makes the change in the ensemble mean, and distributions that have been fitted to the ensemble, potentially misleading. In these large uncertainty cases the change in the ensemble mean is dominated by the randomness of internal variability from the individual ensemble members, and it would be unfortunate if this randomness was allowed to influence adaptation decisions. A standard approach used to manage varying uncertainty in the change in the ensemble mean is to consider statistical significance of the changes (e.g., see 55 shading of regions of statistical significance in climate reports such as the EEA report (European Environment Agency, 2017) or the IPCC 2014 report (Pachauri & Meyer, 2014)). Statistical significance testing involves calculating the signal-tonoise ratio (SNR) of the change in the ensemble mean, where the signal is the ensemble mean change, and the noise is the standard error of the ensemble mean. The SNR is then compared with a threshold value. If the SNR is greater than the threshold then the signal is declared statistically significant (Wilks, 2011). 60 https://doi.org/10.5194/npg-2021-12 Preprint. Discussion started: 19 March 2021 c Author(s) 2021. CC BY 4.0 License.
Use of statistical significance to filter climate projections in this way is appropriate for scientific discovery. However, it is perhaps less appropriate in some practical applications of climate model projections. There are a number of reasons for this, which can be illustrated by considering the shortcomings of a system which applies statistical testing and sets regions of nonsignificant change in the ensemble mean to zero. The first problem with such a system is that analysis of the properties of predictions made using statistical testing show that they have poor predictive skill. This is not surprising, since statistical 65 testing was never designed as a methodology for creating predictions. The second problem is that statistical testing creates abrupt jumps of the climate change signal in space, between significant and non-significant regions, and between different RCPs and time points. These jumps are artefacts of the use of a method with a threshold, rather than an aspect of the climate change signal itself. This may lead to situations in which one location is reported to be affected by climate change, and an adjacent location not, simply because the significance level has shifted from e.g., 95.1% to 94.9%. From a practical 70 perspective this may undermine the credibility of climate predictions in the perception of users, to whom no reasonable physical explanation can be given for such features of the projections. Finally, the almost universal use of a threshold pvalue of 95% creates a skew towards avoiding false positives (type I errors) at the expense of false negatives (type II errors).
Depending on the application, this may not be appropriate. Reducing false negatives in this way is particularly a problem for risk modelling, since risk models should attempt to capture all possibilities in some way, even if low significance. 75 How, then, should those who wish to make practical application of climate model ensembles deal with the issue of varying uncertainty in the change in the ensemble mean, as captured by spatial variation of the SNR, in cases where the uncertainty is large but statistical testing is not appropriate? We describe and compare two model averaging procedures as possible answers to this question. The procedures work by using a bias-variance trade-off argument to reduce the change captured by the ensemble mean when it is uncertain. They are based on standard statistical ideas related to parameter shrinkage as a way 80 of improving prediction performance when fitting distributions (see, e.g., Copas (1983)). We will call the methods Meansquared-error Model Averaging (MMA). The averaging in MMA consists of averaging of the usual estimate for the mean with an alternative estimate of the change in the mean of zero. The averaging weights in MMA depend on the SNR and are designed to minimise the predictive root-mean-squared-error (PRMSE) of the adjusted ensemble mean. One of the two MMA procedures we describe uses a simple plug-in estimator, and we refer to this method as Simple MMA (SMMA). The 85 other procedure involves integration over a Bayesian posterior, and we refer to this method as Bayesian MMA (BMMA). In regions where the SNR is large these methods make no material difference. In regions where the SNR is small, the changes in the ensemble mean are reduced by MMA, in accordance with the theory we present below, in such a way as to increase the accuracy of the predictions. The MMA methods can be considered as continuous analogues of statistical testing, in which rather than setting the change in the ensemble mean to either 100% or 0% of the original value, we allow a continuous 90 reduction that can take any value between 100% and 0% depending on the SNR. As a result, both methods avoid the abrupt jumps created by statistical testing.
We illustrate and test the SMMA and BMMA methods using a large dataset of high-resolution EURO-CORDEX ensemble projections of rainfall over Europe. We consider for four seasons, three rainfall variables, two RCPs and three future time periods, giving 72 cases in all. In section 2 we describe the EURO-CORDEX data we will use. In section 3 we describe the 95 MMA procedures, and present some results based on simulated data which elucidate the situations in which MMA is likely to perform well versus other methods, for both point and probabilistic predictions. In section 4 we present results for one of the 72 cases in detail. We use cross-validation within the ensemble to evaluate the potential prediction skill of MMA, again for both point and probabilistic predictions, and compare with the skill from using the unadjusted ensemble mean, statistical testing and a conventional model averaging scheme (small-sample Akaike Information Criterion model averaging (AICc) 100 (Burnham & Anderson, 2010)). In section 5 we present aggregate results for all 72 cases using the same methods. In section 6 we summarize and conclude.

Data and Methodology
The data we use for our study is extracted from the data archive produced by the EURO-CORDEX program (Jacob, Petersen, & authors, 2014;Jacob, Teichmann, & authors, 2020), in which a number of different global climate model 105 simulations were downscaled over Europe using regional models at 0.11-degree resolution (about 12km). We use data from 10 models, each of which is a different combination of a global climate model and a regional climate model. The models are listed in Table 1. Further information on EURO-CORDEX and the models is given in the guidance report (Benestad, et al., 2017  We extract data for four meteorological seasons (DJF, MAM, JJA, SON), for three aspects of rainfall: changes in the total rainfall (RTOT), the 95 th percentile of daily rainfall (R95) and the 99 th percentile of daily rainfall (R99). We say 'rainfall' even though in some locations we may be including other kinds of precipitation. We consider two RCPs, RCP4.  This example was chosen as the first in the database, rather than for any particular properties it may possess. Figure 1a  120 shows the ensemble mean change ̂ (the mean change calculated from the 10 models in the ensemble) and Fig. 1b shows the standard deviation of the change ̂ (the standard deviation of the changes calculated from the 10 models in the ensemble). Fig. 1c shows the estimated SNR ̂ calculated from the ensemble mean change and the standard deviation of change using the expression ̂= 1/2 |̂| ⁄̂, where the 1/2 term in this equation converts the standard deviation of change (a measure of the spread of the changes across the ensemble) to the standard error of the ensemble mean change (a 125 measure of the uncertainty around the ensemble mean change). Finally, Fig. 1d shows the regions in which the changes in the ensemble mean are significant at the 95% level, assuming normally distributed changes. We see that the ensemble mean change varies considerably in space, with notable increases in RTOT in the west of Ireland, the west of Great Britain, and in parts of France, Germany, Spain and Portugal. The standard deviation of change also varies considerably with the largest values over Portugal, parts of Spain and the Alps. The SNR shows that many of the changes in the West of Ireland, and in 130 France and Germany, have particularly high SNRs, while the changes in Southern Europe (Portugal, Spain, Italy and Greece) have lower values. Accordingly, the changes are statistically significant throughout most of Ireland, Great Britain, France, Germany, and Eastern Europe, but are mostly not statistically significant in Southern Europe. The other 71 cases show similar levels of variability of these four fields, but with different spatial patterns. in each sub-category. Figure 2a sub-divides by season: we see that there is a clear gradient from winter (DJF), which shows 145 the highest values of the spatial mean SNR, to autumn (SON) which shows the lowest values of spatial mean SNR. Fig. 2b sub-divides by rainfall variable: in this case there is no obvious impact on the SNR values. Fig. 2c sub-divides by RCP.
RCP8.5 shows higher SNR values, as we might expect, since in the later years RCP8.5 is based on larger changes in external forcing. Fig. 2d sub-divides by time-period: there is a strong gradient in SNR from the first of the three time-periods to the last. This is also as expected since both RCP scenarios are based on increasing external forcing with time. We would expect 150 these varying SNRs to influence the results from the MMA methods. This will be explored in the results we present below.

Model Averaging Methodologies
The model averaging methodologies we apply are based on standard bias-variance trade-off arguments and are used to average together uncertain projections of change with projections of no change, in such a way as to try and improve predictive skill. The derivations of the methods follow standard mathematical arguments and proceed as follows. 160

Assumptions
For each location within each of the 72 cases, we first make some assumptions about the variability of the climate model results, the variability of future reality, and the relationship between the climate model ensemble and future reality. All quantities are considered as changes from the 1981-2010 baseline. We assume that the actual future value is a sample from a distribution with unknown mean and variance 2 . We assume that the climate model values are independent samples from 165 a distribution with unknown mean and variance 2 . For the BMMA method we will additionally assume that these distributions are normal distributions. With regards to the assumption of independence of samples, this is an approximation, since the models are not entirely independent. Issues related to model dependence and independence have been discussed in, for example, Knutti, et al. (2010) and DelSole, et al. (2013). We assume that methods to address model dependence have been applied before applying MMA. In terms of how the climate models and reality relate to each other, we assume that the 170 climate model ensemble is realistic in the sense that the mean and variance parameters agree, and so = and 2 = 2 . This is a "perfect ensemble" assumption: the models themselves are not perfect, but the distribution they are sampled from perfectly accounts for their biases. This is not likely to be strictly correct, and real climate model ensembles do contain errors and biases, but is a useful working assumption. We will write the future climate state as , and the ensemble mean, estimated in the usual way from the ensemble, as ̂. Since the usual estimator for the mean is unbiased, we can then say: 175 If we write the ensemble variance, estimated using the usual unbiased estimator, as ̂2, then we can say: Uncertainty around the estimate of the ensemble mean is given by: (3) 180

The Simple Mean-squared-error Model Averaging (SMMA) Methodology
In the SMMA method we make a new prediction of future climate in which we adjust the ensemble mean change using a multiplicative factor . is an averaging weight such that the weight on the ensemble mean is and the weight on a change of zero is 1 − .We write the new prediction ̂ as: where the factor , for which we derive an expression below, varies from 0 to 1 as a function of all the parameters of the prediction: season, variable, RCP, time-period and spatial location. The intuitive idea behind this prediction is that if for one particular set of prediction parameters the SNR in the ensemble is large, and the ensemble mean prediction ̂ is statistically significant, then it makes sense to use the ensemble mean more or less as is, and should be close to 1. On the other hand, if the SNR in the change in the ensemble mean is small, and the change in the ensemble mean is far from statistically 190 significant, then perhaps it is better to use a value closer to zero. Statistical testing sets to either 1 or 0 depending on whether the change is significant or not: the SMMA method (and the BMMA method described later) allow it to vary continuously from 1 to 0.
The ensemble mean is the unique value that minimises MSE within the ensemble. However, when considering applications of ensembles, it is generally more appropriate to consider out of sample, or predictive, MSE (PMSE). We can calculate the 195 statistical properties of the prediction errors for the prediction ̂, and the PMSE, as follows: From the above equations we see that for = 0 the bias of the prediction ̂ is and the variance is 2 , giving a PMSE of 2 + 2 . For = 1 the bias is 0 and the variance is 2 (1 + 1 ), giving a PMSE equal to the variance. We now seek to find the value of that minimizes the PMSE. The derivative of the PMSE with respect to is given by From this we find that the PMSE has a minimum at where is the SNR = 1/2 | |⁄ . Equation (10) shows that the value of at the minimum always lies in the interval [0,1].
We see from the above derivation that there is a value of between 0 and 1 which gives a lower PMSE than either the prediction for no change ( = 0) or the unadjusted ensemble mean ( = 1) . Relative to the ensemble mean, the prediction 210 based on this optimal value of has a higher bias, but a lower variance, which is why we refer to it as a bias-variance tradeoff: in the expression for PMSE we have increased the bias squared term, in return for a bigger reduction in the variance term. The PMSE of this prediction is lower than the PMSE of the prediction based on the ensemble mean because of the reduction in the term 2 2 , which represents the contribution to PMSE of the estimation error of the ensemble mean. For an infinite size ensemble, this term would be zero, and the optimal value of would be 1. We can therefore see the prediction ̂ as a small-sample correction to the ensemble mean, which compensates for the fact that the ensemble mean is partly affected by the variability across a finite ensemble.
If we could determine the optimal value of then we could, without fail, produce predictions that would have a lower PMSE than the ensemble mean. However, the expression for given above depends on two unknown quantities, 2 and 2 , and the best we can do is to attempt to estimate based on the information we have. The most obvious estimator is that formed by 220 simply plugging-in the observed equivalents of 2 and 2 , calculated from the ensemble, which are ̂2 and ̂2, giving the plug-in estimate for : This is the estimate of that we will use in the SMMA method.

Relation to Statistical Significance 225
We can relate the value of ̂ to the threshold for statistical significance, since statistical testing for changes in the mean of a normal distribution also uses the observed SNR, in which context it is known as the t-statistic. For a sample of size 10, twotail significance at the 95% confidence level is achieved by signals with a SNR value of 2.262 or greater. This corresponds to a ̂ value of 0.837. All situations with a ̂ value greater than this are therefore statistically significant at the 95% level.

Generation of Probabilistic Predictions 230
Applying SMMA to a climate projection adjusts the mean. By making an assumption about the shape of the distribution of uncertainty, we can also derive a corresponding probabilistic forecast, as follows. We will assume that the distribution of uncertainty, for given values of the estimated mean and variance ̂ and ̂2, is a normal distribution. For the unadjusted ensemble mean, an appropriate predictive distribution can be derived using standard Bayesian methods, which widen and change the predictive distribution so as to take account of parameter uncertainty on the estimates of ̂ and ̂2. Bayesian 235 methods require priors, and sometimes the choice of prior is difficult, but the normal distribution is one of the few statistical models that have a unique objective prior that is relevant in the context of making predictions (see, for example, Lee (1997)).
This prior has a number of attractive properties, including that the resulting predictions match with confidence limits. The predictions based on this prior are t distributions, in which the location parameter is given by the usual estimate for the mean, ̂, the square of the scale parameter (which is not the variance for the t distribution) is given by the usual unbiased estimate 240 for the variance,̂2, and the number of degrees of freedom is the ensemble size minus 1. This formulation gives us probabilistic predictions based on the unadjusted ensemble mean. We then modify it to create probabilistic predictions based on the MMA-adjusted ensemble means: the distribution remains a t distribution, the location parameter is given by the

Bayesian Mean-squared-error Model Averaging (BMMA) Methodology
The BMMA method is derived as an extension of the SMMA method as follows. Since the prediction in the SMMA method ̂ depends on ̂ and ̂, and ̂ depends on ̂ and ̂, we see that the prediction is affected by parameter estimation uncertainty on ̂ and ̂. Since we only have 10 ensemble members with which to estimate these parameters, this uncertainty is large. Different values of ̂ and ̂ from within the range of parameter estimation uncertainty would lead to 250 different values of ̂. We take a Bayesian approach to managing this parameter uncertainty and calculate the average value of ̂ across all possible parameter values weighted by their probability densities (̂, ̂). To calculate the probability densities we use the same objective prior as was used in Sect. 3.2.2 above. To calculate the average we simulate 250 independent pairs of values ̂ and ̂ for each location for each case, calculate 250 values of ̂, and average the 250 values of ̂ together to create our final prediction, which we write as ̂. For purposes of comparison with the SMMA method we 255 can then reverse-engineer an effective value of , given by ̂=̂⁄ .

Simulation results
Given that ̂ and ̂ are only estimated, there is no guarantee that the predictions from the SMMA and BMMA methods will actually have a lower PMSE than the ensemble mean, in spite of the derivation which implies that they should. Before we test SMMA and BMMA on actual climate model output, we can explore whether they may or may not give better 260 predictions using simulations, as follows. We vary a SNR parameter from 0 to 7, in 100 steps. For each value, we simulate 1 million synthetic ensembles, each of 10 points from a normal distribution. For each ensemble we apply the two MMA methodologies, statistical testing and conventional AICc model averaging and compare the resulting predictions with the underlying known mean, which we know in this case because these are ensembles we have generated ourselves. We then calculate the PRMSE of each method relative to the PRMSE of the ensemble mean. Results are shown in Fig. 3a. The 265 horizontal line shows the performance of the unadjusted ensemble mean, which is constant with SNR, and which is determined simply by the variance of the variable being predicted and the parameter uncertainty on the ensemble mean. The has to be crossed before the ensemble mean is used. The green long-dashed line shows the performance of AICc model averaging, which shows results in between statistical testing and the MMA methods. Comparing the four methods, we see there is a trade-off whereby those methods that perform best for small and large SNR values perform the least well for intermediate values. The spatial average performance on a real data-set will then depend on the range of SNR values in that dataset. Although this graph gives us insight into the performance of the various methods, and suggests that, depending on 280 the range of actual SNR values, they may all perform better than the ensemble mean in some cases, it cannot be used as a look-up table to determine which of the methods to use. This is because the results are shown as a function of the actual SNR value (as opposed to the estimated SNR value), and in real cases this actual value is unknown. We can also use simulations to test whether MMA gives better probabilistic predictions. Fig. 3b follows Fig. 3a, but now shows validation of probabilistic predictions using Predictive Mean Log-Likelihood (PMLL), which evaluates the ability of a 295 prediction method to give reasonable probabilities across the whole probability distribution. We show the PMLL values as minus one times PMLL, to highlight the similarities between the results in panels (a)  results for the ensemble mean, SMMA and BMMA. We see that the pattern of change in PMLL from using the two MMA methods is almost identical to the pattern of change in PRMSE: for small values of SNR, the MMA methods give better probabilistic predictions than the ensemble mean, while for large values of SNR, the MMA methods give less good 300 probabilistic predictions than the ensemble mean. The relativity between SMMA and BMMA is also the same as for PRMSE.
One of the implications of these simulation results is that for variables for which the impact of climate change is large and unambiguous, corresponding to large SNR, such as temperature or sea-level rise in most cases, there is little justification for applying the MMA methods, or statistical testing, or AICc, since they would likely make predictions slightly worse. 305 However, for variables such as rainfall, where the impact of climate change is often highly uncertain, corresponding to low SNR, these simulation results suggest it is worth exploring these methods since, depending on the size of the SNR values, they imply that they should improve the predictions relative to using the ensemble mean.

Previous Literature
Similar methods to SMMA have been studied in the statistics literature (see e.g., Copas (1983)) and are sometimes known as 310 parameter shrinkage or damping. The SMMA method is adapted from a method used in commercial applied meteorology, where the same principles of bias-variance trade-off were used to derive better methods for fitting trends to observed temperature data for the pricing of weather derivatives (Jewson & Penzer, 2006). The adaptation of the method to ensemble climate predictions is described in a non-peer-reviewed technical report (Jewson & Hawkins, 2009a). The BMMA method was described and tested using simulations in a second non-peer-reviewed technical report (Jewson & Hawkins, 2009b). The 315 present study is, we believe, the first attempt at large-scale testing of these methods using real climate predictions to evaluate whether they really are likely to improve predictions in practice.

Results for RCP4.5, 2011-2040, RTOT, Winter
We now show results for the SMMA method for the single case that was previously illustrated in Fig. 1. For this case, Fig. 4 shows values of the reduction factor ̂, the adjusted ensemble mean ̂̂ and the difference between the ensemble mean 320 and the adjusted ensemble mean ̂(1 −̂) as both absolute and relative values.
In Fig. 4a we see that in the regions where the ensemble mean is statistically significant (as shown in Fig. 1d), ̂ is close to 1 and the SMMA method will have little effect. In the other regions it takes a range of values, and in some regions, e.g., parts of Spain, it is close to zero. These values of ̂ lead to the prediction shown in Fig. 4b. The prediction does not, overall, look much different from the unadjusted prediction shown in Fig. 1a. The changes in the prediction are more clearly illustrated by 325 the differences shown in Fig. 4c. We see the largest changes in the regions of low standard deviation and low SNR such as Portugal and the Alps. Fig. 4d shows that some of the changes are large relative to the ensemble mean (e.g., in parts of

Cross-validation
We can test whether the adjusted ensemble means created by the MMA methods are really likely to give more accurate predictions than the unadjusted ensemble mean, as the theory suggests they might, by using leave-one-out cross-validation within the ensemble. Cross-validation is commonly used for evaluating methods for processing climate model output in this 350 way (see e.g., Raisanen and Ylhaisi (2010)). This only evaluates potential predictive skill, however, since as we are considering projections of future climate, it cannot involve observations. We apply the following steps: • At each location, for each of the 72 cases, we cycle through the ten climate models, missing out each model in turn • We use the nine remaining climate models to estimate the reduction factors ̂ and ̂.
• We make five predictions using the ensemble mean, the SMMA method, the BMMA method, statistical 355 significance testing and AICc model averaging.
• We compare each of the five predictions with the value from the model that was missed out • We calculate the PMSE over all ten models and all locations, for each of the predictions. • We calculate the PMLL for the ensemble mean and the MMA methods • We calculate the ratio of the PRMSE for the adjusted ensemble mean and statistical significance predictions to the 360 PRMSE of the unadjusted ensemble mean prediction, so that values less than one indicate a better prediction than the unadjusted ensemble mean prediction.
• We also calculate the corresponding ratio for the PMLL results.
For the case illustrated in Fig. 1 and Fig. 4, we find a value of the PRMSE ratio of 0.960 for the SMMA method, 0.930 for the BMMA method, 1.100 for significance testing and 0.964 for AICc. Since the SMMA, BMMA and AICc methods give 365 values that are less than 1, we see that the adjusted ensemble means are, on average over the whole spatial field, giving predictions with a lower PMSE than the ensemble mean prediction. The predictions are 4%, 8% and 4% more accurate, respectively, as estimates of the unknown mean. Since statistical testing gives a value greater than one, we see that it is giving predictions with higher PMSE than the ensemble mean prediction. All these values are a combination of results from all locations across Europe. The PMSE values from the SMMA, BMMA and AICc methods are lower than those from the 370 ensemble mean in the spatial average but are unlikely to be lower at every location. From the simulation results shown in Sect. 3.4 above we know that the MMA and AICc methods are likely giving better results than the unadjusted ensemble mean in regions where the SNR is low, but less good results where the SNR is high. The final average values given above are therefore in part a reflection of the relative sizes of the regions with low and high SNR.
The values of the PMLL ratio for SMMA and BMMA are 0.9983 and 0.9982, and we see that the probabilistic predictions 375 based on the MMA-adjusted ensemble means are also improved relative to probabilistic predictions based on the unadjusted ensemble mean. The changes in PMLL are small, but our experience is that small changes are typical when using PMLL as a metric.

Results for 72 Cases
We now expand our cross-validation testing from one case to all 72 cases, across four seasons, three variables, two RCPs 380 and three time-horizons. Fig. 6 shows the spatial means of the estimates of for both MMA methods for all these cases, stratified by season, RCP, variable and time horizon. The format of Fig. 6 follows the format of Fig. 2: each panel contains 72 black circles and 72 red crosses. Each black circle is the spatial mean over all the estimates of from the SMMA method for one of the 72 cases. Each red cross is the corresponding spatial mean estimate of from the BMMA method. The horizontal lines show the means of the estimates within each sub-set. Figure 6a shows that the estimates of from both 385 methods decrease from DJF to SON. This is because of the decreasing SNR values shown in Fig. 2a. The BMMA method gives higher estimates than the SMMA method on average, and a lower spread of values. There is no clear impact of rainfall variable on the values (Fig. 6b). Figure 6c shows higher values for RCP8.5 than RCP4.5, reflecting the SNR values shown in Fig. 2c. Figure 6d shows values increasing with time into the future, reflecting the increasing SNR values shown in Fig. 2d.   Figure 7 shows corresponding spatial mean PRMSE results and includes results for significance testing (blue plus signs) and AICc (purple triangles). For the SMMA method the PRMSE reduces (relative to the PRMSE of the unadjusted ensemble mean) for 45 out of 72 cases, while for the BMMA method the PRMSE reduces for 51 out of 72 cases. Significance testing performs much worse than the other methods, and only reduces the PRMSE for 5 out of 72 cases. AICc reduces PRMSE for 400 27 out of 72 cases and so performs better than statistical testing but less well than the unadjusted ensemble mean. Considering the relativities of the results between SMMA, BMMA, significance testing and AICc by subset: BMMA gives the best results overall and beats SMMA for 10 out of 12 of the subsets tested. Significance testing gives the worst results and is beaten by SMMA, BMMA and AICc model averaging in every subset. Considering the results of SMMA, BMMA significance testing and AICc relative to the unadjusted ensemble mean by subset: SMMA beats the ensemble mean for 11 405 out of 12 of the subsets tested, BMMA beats the ensemble mean for 12 out of 12 of the subsets tested, significance testing never beats the ensemble mean and AICc beats the ensemble mean for 2 out of 12 of the subsets tested. Considering the variation of PRMSE values by season (Fig. 7a) we see that the SMMA, BMMA, significance testing and AICc all perform gradually better through the year, and best in SON, as the SNR ratio reduces (see Fig. 2a). In SON the results for SMMA and BMMA for each of the 18 cases in that season are individually better than the ensemble mean. Considering the variation of 410 PRMSE values by rainfall variable and RCP ( Fig. 7b and Fig. 7c), we see little obvious pattern. Considering the variation of PRMSE values by time period, we see that SMMA and BMMA show the largest advantage over the unadjusted ensemble mean for the earliest time period, again because of the low SNR values (Fig. 2d).
Considering results over all 72 cases we find average PRMSE ratios of 0.956 and 0.946 for the SMMA and BMMA methods respectively, corresponding to estimates of the future mean climate that are roughly 4% and 5% more accurate than the 415 predictions made using the unadjusted ensemble mean. For significance testing we find average PRMSE ratios of 1.226, corresponding to estimates of the future mean climate that are roughly 23% less accurate than the predictions made using the unadjusted ensemble mean. For AICc we find average PRMSE ratios of 1.02, corresponding to estimates of the future mean climate that are roughly 2% less accurate those from the unadjusted ensemble mean.    Fig. 7, but shows results for PMLL i.e., evaluates the performance of probabilistic predictions.
Given the poor performance of statistical testing and AICc in terms of PRMSE we do not show their results for PMLL. We see that the PMLL results are very similar to the PMSE results in Fig. 7, with BMMA showing the best results, followed by 430 SSMA, followed by the unadjusted ensemble mean. For our EURO-CORDEX data, we conclude that making the mean of the prediction more accurate also makes the probabilistic prediction more accurate, which implies that the distribution shape being used in the probabilistic predictions is appropriate.  Figure 9 shows further analysis of these results. Figure 9a shows the mean values of the estimates of for the SMMA and 440 BMMA methods, versus the SNR for all 72 cases. The connection between the mean SNR and the mean is now very clear, with mean increasing with mean SNR. This panel also shows that the BMMA method gives higher values on average for all values of SNR, but particularly for low values of SNR. Figure 9b shows the reduction in the root mean squared size (as opposed to error) of the prediction, which is a measure of the impact of the model averaging. The two methods give very similar results, in which the impact is greatest for the cases with low SNR. These are average reductions over the whole of 445 Europe: locally, the reduction takes values in the whole range from 0 to 1. Figure 9c shows the PRMSE, but now calculated from relative errors, relative to the spatially varying ensemble mean. Values less than one for all 72 cases indicate that the MMA methods perform better than the unadjusted ensemble mean more comprehensively by this measure. The difference between these results and the straight PRMSE results arises because the locations where the MMA methods improve predictions the most on a relative basis tend to be the ones with small signals, which tend to have small prediction errors. 450

Further analysis
These locations do not contribute very much to the straight PRMSE but contribute more when the errors are expressed in a relative sense. Figure 9d shows a scatter plot of the PRMSE versus the SNR for the two methods for all 72 cases. There is a clear relation in which the MMA methods perform best for small SNR values. The relation is similar to that shown in the simulation experiment results shown in Fig. 2, but with the cross-over points (shown by vertical lines) shifted to the right, because these are now relations between averages over many cases with different underlying values for the unknown real 455 SNR. We see that for every case in which the mean SNR is less than 2.81 the SMMA method performs better than the unadjusted ensemble mean on average, and for every case in which the mean SNR is less than 3.02 the BMMA method performs better than the unadjusted ensemble mean on average.  The results in Sect. 5 can be summarized as follows: for the EURO-CORDEX rainfall data, SMMA and BMMA give more accurate predictions on average, in both a point and probabilistic sense, than the unadjusted ensemble mean. BMMA gives more accurate results than SMMA. The MMA methods do well because the ensemble mean is uncertain and has low SNR values at many locations. The benefits of SMMA and BMMA are greatest in the cases with the lowest SNR values. Ensemble climate projections can be used to derive probability distributions for future climate, and the ensemble mean can be used as an estimate of the mean of the probability distribution. Because climate model ensembles are always finite in size, changes in the ensemble mean are always uncertain, relative to the changes in the ensemble mean that would be given by an infinite ensemble. The uncertainty varies in space. In regions where the signal-to-noise ratio (SNR) of the change in the ensemble mean is high, the change in the ensemble mean gives a precise estimate of the change in the mean climate that 475 would be estimated from the infinite ensemble. However, in regions where the SNR is low, the interpretation of the change in the ensemble mean is a little more difficult. For instance, when the SNR is very low, the change in the ensemble mean is little more than random noise generated by variability in the members of the ensemble, and cannot be taken as a precise estimate of the change in mean climate of the infinite ensemble. In these cases, it might be unfortunate if the ensemble mean were interpreted too literally, or were used to drive adaptation decisions. 480 We have presented two bias-variance trade-off model averaging algorithms that adjust the change in the ensemble mean as a function of the SNR in an attempt to improve predictive accuracy. We call the methods Mean-squared error Model Averaging (MMA). These methods are designed to be applied after bias correction and corrections for inter-model correlation have been applied. One method is very simple (simple MMA, SMMA), and the other is a more complex Bayesian extension (Bayesian MMA, BMMA). The methods can both be thought as continuous generalisations of statistical 485 testing, where instead of accepting or rejecting the change in the ensemble mean they apply continuous adjustment. They can also be thought of as small-sample corrections to the estimate of the ensemble mean. When the SNR is large the ensemble mean is hardly changed by these methods, while when the SNR is small the change in the ensemble mean is reduced towards zero in an attempt to maximise the predictive skill of the resulting predictions.
We have applied the MMA methods to a large set of rainfall projections from the EURO-CORDEX ensemble, for 72 490 different cases across four seasons, three different rainfall variables, two different RCPs and three future time periods during the 21 st century. This data shows large variations in the SNR, which results in large variations of the extent to which the ensemble mean is adjusted by the methods.
We have used cross-validation within the ensemble to test whether the adjusted ensemble means achieve greater potential predictive skill for point predictions and probabilistic predictions. To assess point predictions we used predictive mean 495 squared error (PMSE) and to assess probabilistic predictions we used predictive mean log-likelihood (PMLL). For both measures, we compared against results based on the unadjusted ensemble mean. For PMSE we have additionally compared against results based on statistical testing and small-sample Akaike Information Criterion model averaging (a standard method for model averaging). We emphasize that these calculations can only tell us about the potential accuracy of the method, not the actual accuracy, since we cannot compare projections of future climate with observations. On average over 500 all 72 cases and all locations, the MMA methods reduce the PMSE, corresponding to what is roughly a 5% increase in potential accuracy in the estimate of the future mean climate. For the SMMA method, the PMSE reduces for 45 of the 72 https://doi.org/10.5194/npg-2021-12 Preprint. Discussion started: 19 March 2021 c Author(s) 2021. CC BY 4.0 License. cases, while for the BMMA method the PMSE reduces for 51 out of 72 cases. Which cases show a reduction in PMSE and which not depends strongly on the mean SNR within each case, in the sense that the MMA methods perform better when the SNR is low. For instance, the winter SNRs are high, and the average PMSE benefits of the MMA methods are marginal. The 505 autumn SNRs are much lower, and the MMA methods beat the unadjusted ensemble mean in every case. Significance testing, by comparison, gives much worse PMSE values than the unadjusted ensemble mean, and AICc model averaging gives slightly worse PMSE values than the unadjusted ensemble mean. For PMLL we also found that the MMA methods beat the unadjusted ensemble mean.
The ensemble mean can be used as a standalone indication of the possible change in climate, or as the mean of a distribution 510 of possible changes in a probabilistic analysis. We conclude that in both cases, when the ensemble mean is highly uncertain, the MMA-adjusted ensemble means described above can be used in its place. Applying MMA has various advantages: (a) it reduces the possibility of over-interpreting changes in the ensemble mean that are very uncertain, while not affecting more certain changes, (b) relative to significance testing, it avoids jumps in the ensemble mean change in space and between scenarios and (c) when the SNR is low, it will likely produce more accurate predictions than predictions based on either the 515 unadjusted ensemble mean or statistical testing. In addition to the above advantages, relative to statistical testing the MMAadjusted ensemble mean reduces the likelihood of false negatives (i.e., missing a change due to climate change) and increases the likelihood of false positives (i.e., falsely identifying a change as being due to climate change). Whether this is an advantage or not depends on the application, but is typically beneficial for risk modelling.

Author Contribution 520
SJ designed the study and the algorithms, wrote and ran the analysis code, produced the graphics and wrote the text. GB, PM and JM extracted the EURO-CORDEX data. MS wrote the code to read the EURO-CORDEX data. All the authors contributed to proof-reading.