We give a simple description of the blessing of dimensionality with the main focus on the concentration phenomena. These phenomena imply that in high dimensions the lengths of independent random vectors from the same distribution have almost the same length and that independent vectors are almost orthogonal. In the climate and atmospheric sciences we rely increasingly on ensemble modelling and face the challenge of analysing large samples of long time series and spatially extended fields. We show how the properties of high dimensions allow us to obtain analytical results for e.g. correlations between sample members and the behaviour of the sample mean when the size of the sample grows. We find that the properties of high dimensionality with reasonable success can be applied to climate data. This is the case although most climate data show strong anisotropy and both spatial and temporal dependence, resulting in effective dimensions around 25–100.

In many areas of geophysics we operate in high-dimensional spaces. Examples from the atmospheric and climate sciences include extended spatial fields, such as precipitation or near-surface temperature, and long time series of atmospheric variables, such as the global mean temperature. These fields and time series may be either observed or modelled. Over the last decades ensemble modelling has been generally accepted as a valuable tool to gauge the unpredictability and error originating from uncertain initial conditions or deficiencies in model physics. There is also an increased tendency for gridded observational products and reanalyses to apply ensemble techniques to represent the different uncertainties. We are therefore often in a situation where we need to analyse large samples of high-dimensional fields. These samples could consist of the individual ensemble members or just individual years of a spatial field.

This might seem a daunting challenge as the properties of high-dimensional space often appear counterintuitive to minds experienced
only in the low-dimensional world. However, the properties of high-dimensional space may sometimes simplify the analysis and allow us to
obtain rather general analytical results. A central result in this
respect is the concentration of measures which, with a quote from

These advantageous properties of high dimensionality – often referred
to as the blessing of dimensionality – have rarely been applied to
the atmospheric and climate sciences. Exceptions are our previous
papers on the subject. In

In Sect.

Results for a unit cube in

Here we give a brief overview of the properties of high-dimensional spaces. We begin in Sect.

The properties of high-dimensional spaces often defy our intuition based on two and three dimensions

The properties of high-dimensional spaces are sometimes called the curse and sometimes the blessing of dimensionality, depending on the considered problem. In the present context these properties turn out to be a blessing as they strongly simplify the analysis and make analytical results possible.

As a simple example we consider a cube in

Comparable to the number of atoms in 30

Consider now a sample of points drawn independently from the high-dimensional cube. For moderate sample size (

The beneficial properties of high dimensionality are recognized in many
areas of machine learning

We first look at a very simple example to describe the general idea of
concentration of measures. Consider

The considerations above are basically the rationale behind the law
of large numbers and are also closely related to the central limit
theorem which states that

Let us organize the random variables into an

Let us consider a multi-variate standard Gaussian distribution

The concentration of measures is the backbone of statistical mechanics.
As a simple example, we consider the canonical ensemble of weakly
interacting identical particles. This ensemble describes a system with a
constant number of particles,

Let us take a brief look at waist concentration. Consider two independent
unit vectors

The topic of concentration properties is an active mathematical
field with focus on probabilistic bounds on how quickly empirical means converge to the ensemble means for different classes of random variables,
including non-iid variables

The AgERA data set. Daily means with 5 d separation for June 1980–1990.

Like the central limit theorem (CLT), the concentration properties are
originally developed for iid variables. However, also as the central limit
theorem, they can be extended to classes of dependent variables. Although
no general condition exists for the CLT

Here correlations generally refer to measures of the dependence, e.g. the distance between the joint distribution and the product of the marginal
distributions. Note that the decay of Pearson's correlation coefficient is not necessarily sufficient as a zero correlation coefficient does
not guarantee independence as it only gauges linear dependence. The
auto-regressive moving-average (ARMA) models which are often used in
geophysics are examples of mixing processes

The mixing and decay of correlations are closely related
to the concept of effective degrees of freedom also known
as the effective dimension

The situation is well known in the study of one-dimensional time series

In the case of two-dimensional fields, different methods exist to estimate the number of effective dimensions

These numbers are of course small compared to Avogadro's number relevant for statistical mechanics, but they are still comparable to the dimensions
in Fig.

As we saw in Sect.

For initial condition ensembles consisting of experiments with the same
model but with different initial conditions, the different ensemble
members can be considered independent (considering anomalies with
respect to the ensemble centre as explained in the next subsection). For multi-model ensembles where experiments are performed with models
with different physical parameterizations (but the same external
forcings), the situation is more complicated

Another way to obtain independent samples from the same distribution is to consider a given variable at different times. For example, we could look at the spatial field of precipitation or temperature at different days or months. To ensure that the fields are drawn from the same distribution, we need to avoid or remove the annual cycle and – if longer periods are considered – to make sure that there is no external forcing. The sample times should also be sufficiently separated.

In the next sections we will consider the following geophysical
data sets. (1) Daily means of near-surface temperature and precipitation from AgERA for June in the period 1980–1990. The
AgERA provides daily surface meteorological data for agro-ecological
studies (doi: 10.24381/cds.6c68c9bb) based on ECMWF's ERA5 reanalysis

In addition to the geophysical data, we also include two simple samples of independent vectors. The first sample consists of independent vectors
drawn from an

The multi-model 45-member CMIP5 ensemble. Monthly climatology in TAS (K).

The normalized distances as a function of dimension

Summary of the different measures. Entries show mean/standard deviation.
Units are K for temperature and

In this subsection we directly investigate the distributions of the
lengths of the sample members and the distributions of the angles between
them. The results from this and the following subsections are summarized
in Table

We centre the sample,

We first consider the near-surface temperature and precipitation fields from the AgERA data set. Figure

Figure

Results for the MPI-GE 100-member initial condition ensemble are shown in Table

Reducing the spatial area decreases the effective dimension. As an
example we have included in Table

In the analysis above we centred the sample to the sample mean before calculating the lengths; i.e. we used

Distances between samples (cyan) and between sample and sample mean (blue).

Correlations as a function of dimension

If the sample members are drawn independently from the same distribution
in high dimensions, they have approximately the same length, and we can write

Therefore, the distance between two sample members is a square root of 2 larger than the distance between a sample member and the sample mean. The geometric interpretation is that the sample mean and any two
sample members form an isosceles right triangle with the right angle at
the sample mean

Figure

Figure

The indistinguishable interpretation claims that observations are
drawn from the same distribution as the ensemble members. With this
assumption and the considerations above,

The results in this subsection and Sects.

Correlations of pairs of sample differences

The length of the sample mean

Error correlations and correlations between model differences are
important when studying the structure of a model ensemble and when comparing
an ensemble to observations

We have in general

Replacing

Figure

The correlations for AgERA daily mean precipitation for June and for
the CMIP5 monthly climatology of near-surface temperate are shown in
Fig.

If we again assume that the observations are drawn from the same
distribution as the ensemble members – the indistinguishable
interpretation – the error correlation is

With

We now consider how the sample mean depends on the
sample size. The ensemble mean is often used to estimate
the forced response from initial condition and multi-model
ensembles

Letting

The practical way to estimate the effect of sample size is to apply a
bootstrap procedure to a large sample of size

The mean is shown as a function of

In the three previous subsections we studied the samples of daily
June temperatures and of monthly climatologies. In the former the

The right panel shows

It is well known that the number of samples necessary for a given coverage increases exponentially with the dimension. In this paper we have described other more non-intuitive properties of high-dimensional space such as the concentration of measures and waist concentration. In loose terms these properties state that independent sample members from the same distribution have the same lengths and that pairs of independent sample members are orthogonal. While most results are derived for iid random variables, we discussed the extension to the non-iid situation and how the strength of the dependence is related to the effective dimension.

We directly investigated to which extent these properties hold for typical climate fields and time series. Ensemble modelling provides an obvious source of samples, but samples can also be obtained by considering e.g. different days or years. We investigated the monthly climatology of both an initial condition ensemble and a multi-model ensemble. We also investigated fields of daily means from a reanalysis. While the nominal dimensions of such fields are high, the effective dimensions are typically of the order 25–100, and it is not obvious to which degree the properties of high-dimensional dimension apply to such fields.

We found that for the global-scale fields of near-surface temperature and precipitation, both the concentration of measures and the waist concentration hold to a reasonable degree. The lengths of the sample
members are rather narrowly distributed around the mean length, with widths (standard deviation) around

Regarding the model ensembles, the concentration properties are better fulfilled for the initial condition ensemble (MPI-GE) than for the multi-model ensemble (CMIP5). In the latter case the dependence of related models will result in these models being far from orthogonal.

Based on the concentration properties, we derived simple analytical results that hold for large dimensions. These analytical results
include (1) the distances between two sample members are a factor of

We conclude that in many cases the concentration properties allow us a deeper understanding the behaviour of samples of climate fields. However, in each case it is important to investigate whether the conditions of high dimensionality and independence are fulfilled. Even for global fields there is a substantial spread around the values predicted for the high-dimensional limit.

We have only briefly mentioned the relation between observations and
models. The relation depends on whether we assume that observations
are drawn from the same distribution as the model ensemble
(the indistinguishable interpretation) or whether we assume
that the ensemble members are centred around the observations (truth-centred interpretation). In the former case the results for
individual model members also hold for observations, as we discussed in Sect.

The Interactive Data Language (IDL) code used in the analysis can be requested from the author.

The AgERA reanalysis was downloaded from

The MPI Grand Ensemble Project

The CMIP5 data were downloaded from

The author declares that there is no conflict of interest.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The author acknowledges the support from the NordForsk-funded Nordic Centre of Excellence and the European Union.

We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output. For CMIP the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provided coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals.

This research has been supported by the NordForsk (award no. 76654) and the Horizon 2020 (EUCP (grant no. 776613)).

This paper was edited by Stéphane Vannitsem and reviewed by Maarten Ambaum and one anonymous referee.