Most available verification metrics for ensemble forecasts focus on univariate quantities. That is, they assess whether the ensemble provides an adequate representation of the forecast uncertainty about the quantity of interest at a particular location and time. For spatially indexed ensemble forecasts, however, it is also important that forecast fields reproduce the spatial structure of the observed field and represent the uncertainty about spatial properties such as the size of the area for which heavy precipitation, high winds, critical fire weather conditions, etc., are expected. In this article we study the properties of the fraction of threshold exceedance (FTE) histogram, a new diagnostic tool designed for spatially indexed ensemble forecast fields. Defined as the fraction of grid points where a prescribed threshold is exceeded, the FTE is calculated for the verification field and separately for each ensemble member. It yields a projection of a – possibly high-dimensional – multivariate quantity onto a univariate quantity that can be studied with standard tools like verification rank histograms. This projection is appealing since it reflects a spatial property that is intuitive and directly relevant in applications, though it is not obvious whether the FTE is sufficiently sensitive to misrepresentation of spatial structure in the ensemble. In a comprehensive simulation study we find that departures from uniformity of the FTE histograms can indeed be related to forecast ensembles with biased spatial variability and that these histograms detect shortcomings in the spatial structure of ensemble forecast fields that are not obvious by eye. For demonstration, FTE histograms are applied in the context of spatially downscaled ensemble precipitation forecast fields from NOAA's Global Ensemble Forecast System.

Ensemble prediction systems like the ECMWF ensemble

When entire forecast fields are considered, aspects beyond univariate calibration are important. For example, ensembles that yield reliable
probabilistic forecasts at each location may still over- or under-forecast regional minima/maxima if their members exhibit an inaccurate spatial
structure (e.g.,

Simulated verification field and three associated forecast fields (arbitrary color scale) in which the spatial correlation length is either the same as for the verification, 10 % miscalibrated, or 50 % miscalibrated. Can you tell which is correct?

There is an added difficulty for forecasters in that misrepresentation of the spatial structure of weather variables by ensemble forecast fields may not be discernible by eye. For example, consider the simulated fields in Fig.

Several multivariate generalizations of verification rank histograms, such as minimum spanning tree histograms

The projection underlying the verification metric studied here is based on threshold exceedances of the forecast and observation fields. This
binarization of continuous weather variables is common in spatial forecast verification (see

In Sect. 2, we describe the calculation of the FTE and the construction of the FTE histogram in detail. In Sect. 3, a simulation study is designed and implemented that allows us to analyze the discrimination capability of the FTE histograms with regard to spatial structures. In Sect. 4, we demonstrate the utility of FTE histograms in the context of spatially downscaled ensemble precipitation forecast fields from NOAA's Global Ensemble Forecast System. A discussion and concluding remarks are given in Sect. 5.

Let

Suppose we have a

Gathering ranks over

If the marginal forecast distributions are miscalibrated, the resulting effects on the rank of the verification FTE are superimposed on those caused
by misrepresentation of spatial correlations. This complicates interpretation because it is often impossible to disentangle the different sources of
miscalibration (this loss of information is an inevitable consequence of projecting a multivariate quantity onto a univariate one), and it can even
happen that different effects cancel each other out. For example, ensemble forecast fields which are both under-dispersive and have too strong spatial
correlations may result in flat FTE histograms. This serves as a reminder that – as in the univariate case – a flat histogram is a necessary but not
sufficient condition for probabilistic calibration. It simply indicates that the verification and the ensemble are indistinguishable with regard to the particular aspect of the forecast fields (here: exceedance of a prespecified threshold) assessed by this metric. Systematic over- or
under-forecast biases can be accounted for by using different (depending on the respective climatology) threshold values

Characterization of FTE histogram shapes via

Skewness is exaggerated by high thresholds; see text for more detail.

While the FTE histogram is a useful visual diagnostic tool, a quantitative measure for studying departures from uniformity is desirable. Akin to

In summary, the FTE metric is composed of three steps: (1) calculate the FTE of each verification and ensemble forecast field, (2) construct an FTE
histogram over available instances of forecast and verification times, and (3) derive the

In this section we consider an extensive simulation study to assess the ability of the proposed FTE histogram to diagnose deficiencies in the
representation of spatial variability by the ensemble forecast fields. Our simulations will be based on multivariate Gaussian processes where the
notion of “spatial variability” can be quantified in terms of a correlation length parameter. The various meteorological quantities of interest such
as precipitation and wind speeds can be quite heterogeneous and spatially nonstationary over the study domain. However, since we study the spatial
structure of threshold exceedances, a suitable choice of thresholds can mitigate these effects to a degree that multivariate, stationary Gaussian
processes can be viewed as a sufficiently flexible model for simulating realistic spatial fields. To see this, consider a strictly positive and
continuous variable

The main technical difficulty in setting up the simulation study is in generating multiple, stationary Gaussian random fields that have different
correlation lengths while being correlated with each other. That is, we would like to generate

We call a vector of processes

There are many models for multivariate processes

Simultaneously simulating the verification field

The simulation setup follows a series of steps.

Generate

Generate 11 independent mean zero Gaussian random fields

The ensemble member fields

In this study, fields were constructed on a square grid over the domain

The question of primary interest in this analysis is whether the FTE histogram accurately identifies miscalibration of ensemble correlation lengths.

First, we study the discrimination ability of the FTE histogram in something of an exaggerated setting, where the miscalibration is obvious. We choose
the median of the marginal distribution as the threshold (i.e.,

Example binary exceedance verification field and a subset of ensemble fields with representative FTE histogram for threshold

As Fig.

While the FTE histogram is able to correctly identify the obvious miscalibration of the ensemble for the scenario in Fig.

As Fig.

Of course, one may often want to use a threshold parameter other than the median of the marginal distributions. The choice of

Recall that we propose quantifying the shape of the FTE histogram with the

Estimated beta distribution parameters (top) and corresponding

Estimated

Another variable of interest in evaluating the FTE histograms is the size of the domain to which the metric is applied. In our simulation framework,
making the domain larger or smaller while keeping the correlation length constant is equivalent to keeping the domain size constant and varying the
correlation length of the verification field. That is, for a fixed domain size, a smaller correlation length mimics a “large domain” (with low
resolution) and a larger correlation length mimics a “small domain” (with high resolution). Analyzing the

We now turn attention back to our motivating figure (Fig.

Distributed hydrological models like NOAA's National Water Model (NWM) require meteorological inputs at a relatively high spatial resolution. At
shorter forecast lead times (typically up to 1 or 2 days ahead) limited-area NWP models provide such high-resolution forecasts, but for longer lead times only forecasts from global ensemble forecast systems like NOAA's GEFS are available. These come at a relatively coarse resolution and need
to be downscaled (statistically or dynamically) to the high-resolution output grid. Here, we use a combination of the statistical post-processing
algorithm proposed by

We consider 6 h precipitation accumulations over a region in the southeastern US between

Examples of different data fields for 6 h precipitation accumulation on 24 July 2004

In order to obtain calibrated ensemble precipitation forecasts at the CCPA grid resolution, we proceed in three steps. First, we apply the post-processing algorithm by

Before applying the FTE histogram to investigate whether the spatial disaggregation used in the downscaling method produces precipitation fields with
appropriate sub-grid-scale variability, we check the calibration of the univariate ensemble forecasts across all fine-scale grid points. We study (separately) the months January, April, July, and October in order to represent winter, spring, summer, and fall, respectively. Daily analyses and
corresponding ensemble forecasts from each of these months are pooled over the entire verification period and all grid points within the study area and are used to construct the verification rank histograms in Fig.

Verification rank histograms (density) for downscaled fields at representative months with cases of fully tied ranks removed. Estimated

Ideally, the statistical post-processing and downscaling should yield calibrated ensemble forecasts and thus rank histograms that are approximately
uniform. Clearly, the rank histograms for the downscaled forecast fields shown in Fig.

In the remaining analysis, we employ FTE histograms to investigate the spatial properties of the ensemble forecast fields obtained by the downscaling
algorithm for the same representative months outlined above. Spatial variability of precipitation fields depends on whether precipitation is
stratiform or convective, and in the latter case also on the type of convection (local vs. synoptically forced). The frequency of occurrence of these categories has a seasonal cycle, and it is therefore interesting to study how well the downscaling methodology works in different seasons. The first
step in computing the FTE is deciding what value to use for the threshold. If the climatology varies strongly across the domain, it may be desirable
to use a variable threshold such as a climatology percentile. However, the southeastern US is a flat and relatively homogeneous region, meaning the precipitation accumulation patterns will not be affected as much by orography, and we therefore select a fixed threshold for constructing FTE
histograms. Another advantage of this approach is that a fixed threshold has a direct physical interpretation; here we use thresholds of 5, 10, and
20

FTE histograms for downscaled fields at different thresholds in representative months. Estimated

In Fig.

When forecasting meteorological variables on a spatial domain, it is important for many applications that not only the marginal forecast distributions, but also the spatial (and/or temporal) correlation structure is represented adequately. In some instances, misrepresentation of spatial structure by
ensemble forecast fields may be visually obvious; otherwise, a quantitative verification metric is desired to objectively evaluate the ensemble
calibration. The FTE metric studied here is a projection of a multivariate quantity (i.e., a spatial field) to a univariate quantity and can be combined with the concept of a (univariate) verification rank histogram to analyze the spatial structure of ensemble forecast fields. This idea was first applied by

In this paper, we performed a systematic study in which we simulated ensemble forecast and verification fields with different correlation lengths to
understand how well a misspecification of the correlation length can be detected by the FTE metric. To this end, the metric was slightly extended and
is composed of three steps: (1) calculate the FTE of each verification and ensemble forecast field, (2) construct an FTE histogram over available
instances of forecast and verification times, and (3) derive the

The FTE metric is relatively simple and enjoys an easy and intuitive interpretation. In particular, the

The data vector of ranks

Let

Simulate a (continuous) uniform random variable:

Set

Let

Suppose

The simulation setup introduced in Sect.

Figure

A similar story is provided by Fig.

As Fig.

As Fig.

The code and data used for this study are available in the accompanying Zenodo repositories (code:

This study is based on the Master's work of JJ under supervision of WK and MS. The concept of this study was developed by MS and extended upon by all involved. JJ implemented the study and performed the analysis with guidance from WK and MS. JB provided the downscaled GEFS forecast fields. JJ, WK, and MS collaborated in discussing the results and composing the manuscript, with input from JB on Sect. 4.

The authors declare that they have no conflict of interest.

This article is part of the special issue “Advances in post-processing and blending of deterministic and ensemble forecasts”. It is not associated with a conference.

Josh Jacobson was supported by NSF DMS-1407340. William Kleiber was supported by NSF DMS-1811294 and DMS-1923062. Michael Scheuerer and Joseph Bellier were supported by funding from the US NWS Office of Science & Technology Integration through the Meteorological Development Laboratory, project no. 720T8MWQML. This work utilized resources from the University of Colorado Boulder Research Computing Group, which is supported by the NSF (awards ACI-1532235 and ACI-1532236), the University of Colorado Boulder, and Colorado State University.

This research has been supported by the National Science Foundation (grant nos. DMS-1407340, DMS-1811294, and DMS-1923062) and by the US National Weather Service Office of Science & Technology Integration (grant no. 720T8MWQML).

This paper was edited by Maxime Taillardat and reviewed by two anonymous referees.