Accurate specification of the error statistics required for data assimilation remains an ongoing challenge, partly because their estimation is an underdetermined problem that requires statistical assumptions. Even with the common assumption that background and observation errors are uncorrelated, the problem remains underdetermined. One natural question that could arise is as follows: can the increasing amount of overlapping observations or other datasets help to reduce the total number of statistical assumptions, or do they introduce more statistical unknowns? In order to answer this question, this paper provides a conceptual view on the statistical error estimation problem for multiple collocated datasets, including a generalized mathematical formulation, an illustrative demonstration with synthetic data, and guidelines for setting up and solving the problem. It is demonstrated that the required number of statistical assumptions increases linearly with the number of datasets. However, the number of error statistics that can be estimated increases quadratically, allowing for an estimation of an increasing number of error cross-statistics between datasets for more than three datasets. The presented generalized estimation of full error covariance and cross-covariance matrices between datasets does not necessarily accumulate the uncertainties of assumptions among error estimations of multiple datasets.

Accurate specification of the error statistics used for data assimilation has been an ongoing challenge. It is known that the accuracy of both background
and observation error covariances have a strong impact on the performance of atmospheric data assimilation

Outside of the field of data assimilation, two different methods have been developed that allow for a statistically optimal estimation of scalar error
variances for fully collocated datasets. Although similar, these two methods have been developed independently of each other in different scientific
fields. One method, called the three-cornered hat (3-CH) method, is based on

Independent of these developments,

While the estimation of three error variances has been well established for decades, recent developments propose different approaches to extend the
method to a larger number of datasets. As observed by studies such as

This demonstrates that the different approaches available for more than three datasets provide only an incomplete picture of the problem, as each
approach is tailored to the specific conditions of the respective application. Aiming for a more general analysis, this paper approaches the problem
from a conceptual point of view. The main questions to be answered are as follows:

How many error statistics can be extracted from residual statistics between multiple collocated datasets?

How many statistics remain to be assumed?

How do inaccuracies in assumed error statistics affect different estimations of error statistics?

What are the general conditions to set up and solve the problem?

Suppose a system of

Thus, the total number

While error statistics with respect to the truth are usually unknown in real applications, residual covariances can be calculated from the residuals between
each pair of different datasets. The main idea now is to express the known residual statistics as functions of unknown error statistics
(Sect.

Relation between different numbers of statistics (covariances and cross-covariances) as a function of the number of datasets. Shown are

Because

In most applications of geophysical datasets, like in data assimilation, the estimation of error covariances is highly crucial, whereas their error
cross-covariances are usually assumed to be negligible. Given the greater need to estimate the

The relation between the number of datasets, residual covariances, and assumed and estimated error statistics is visualized in Fig.

For more than three datasets (

Note that almost all numbers presented above apply to the general case, in which any combination of error covariances and cross-covariances may be given
or assumed. While the interpretation of the numbers

This section gives the theoretical formulation for the exact statistical formulations of complete error covariance and cross-covariance matrices from
fully spatiotemporally collocated datasets. Similar to the 3CH method, the errors are assumed to be random, independent among different realizations,
but with common error statistics for each dataset. The notation is introduced in Sect.

This first part of the mathematical theory includes the following new elements: (i) the separation of cross-statistics into a symmetric error dependency and an error asymmetry (Sect.

Suppose

Let

In the symmetric case, each element

Note that residual and error cross-covariance matrices are generally asymmetric in the non-scalar formulation presented here, but the following
relations hold for residual as well as (similarly) for error cross-covariance matrices:

The symmetric properties of residual and error covariances follow directly from their definition:

The sum of an (asymmetric) cross-covariance matrix and its transpose is denoted as

Although error cross-covariances may be asymmetric, the error dependency matrix is symmetric by definition:

Likewise, the sum of the residual cross-covariance matrices between

The difference between a cross-covariance matrix and its transpose is a measure of asymmetry in the cross-covariances and is, therefore, denoted as

Likewise, the difference between the residual cross-covariance matrices between

For real geophysical problems, the available statistical information comprises (i) the residual covariance matrices of each pair of datasets and (ii) the residual cross-covariance matrices between different residuals of datasets. The forward relations of residual covariances and residual cross-covariances as functions of error statistics are formulated in the following. For the estimation of error statistics, it is important to quantify the number of independent input statistics that determines the number of possible error estimations. Therefore, this section also includes an evaluation of the relation between residual cross-covariances and residual covariances in order to specify the additional information content of residual cross-covariances.

Each element

Thus, the complete residual covariance matrix of

Equation (

Note that, although the error dependency matrix is symmetric by definition, it is the sum of two error cross-covariances that are generally asymmetric
and, thus, differ in the non-scalar formulation. In the scalar case, the two error cross-covariances reduce to their common error cross-variance and the
residual covariance reduces to the scalar formulation of the variance, as shown in studies such as

Each element

Equation (

In the following, it is demonstrated that combinations of residual cross-covariances contain the same statistical information as residual covariance matrices.

For

The relation between residual covariances and residual cross-covariances in Eq. (

In the general asymmetric case, Eq. (

Equation (

As an extension to previous work, this section provides generalized formulations of error covariances, cross-covariances, and dependencies in matrix
form. These formulations are based on the relations between residual and error statistics in Eqs. (

Equation (

Equation (

Given

Equation (

Because of changing signs, Eq. (

A formulation of each individual error dependency matrix as a function of the error covariances of the two datasets and their residual covariance
results directly from Eq. (

Being a symmetric matrix, residual covariances cannot provide information on error asymmetries nor on the asymmetric components of error
cross-covariances. Only the symmetric component of error cross-covariances could be estimated from half the error dependency, which is equivalent to a
zero error asymmetry matrix:

The general forward formulation of residual cross-covariances in Eq. (

The scalar formulation of Eq. (

Similarly to Eq. (

Two of the error cross-covariances in Eq. (

With this, Eq. (

Because the residual cross-covariances can be rewritten as

The forward formulation of residual cross-covariances does not allow for an elimination of one single error cross-covariance, even when multiple
equations are combined. One formulation of an error cross-covariance matrix as a function of residual cross-covariances results directly from the
forward relation:

Note that the third dataset

Any of the formulations of error cross-covariances can also be used for a formulation of the error dependency matrix

The equivalence demonstrates that, as they must be, the exact formulations of error statistics from residual covariances and cross-covariances are
consistent with each other. This consistency applies to the exact formulations of all symmetric error statistics (error covariances and dependencies)
and results from the consistent definitions of residual covariances and cross-covariances in Eqs. (

Based on the exact formulations in Sect.

In addition to the optimal extension to more than three datasets, this second part of the mathematical theory includes the following new elements:
(i) the analysis of differences between error estimates from residual covariances and cross-covariances (Sect.

As demonstrated in Sect.

The independence assumption resembles the innovation covariance consistency of data assimilation, where the residual covariance between background and
observation datasets – denoted as innovation covariance – is assumed to be equal to the sum of their error covariances in the formulation of the
analysis

Because all error cross-statistics need to be assumed in this setup, approximations of these cross-covariances and dependencies only reproduce the initially assumed statistics and do not provide any new information.

Assuming independent error statistics among all three datasets or, similarly, that error dependencies are negligible compared to residual covariances

In the scalar case, Eq. (

Under the assumption of independence among all three datasets

and, likewise,

As described in Sect.

Equations (

The three independent estimates of an error covariance matrix from the same pair of other datasets differ only with respect to their residual asymmetry. Thus,
differences between the estimates from Eqs. (

While the estimation from residual covariances remains symmetric by definition, the estimates of error covariances from residual cross-covariances may
become asymmetric. This asymmetry can be eliminated using the residual asymmetry matrix, which is also equivalent to averaging both formulations of
error covariances from residual cross-covariances:

All three estimates become equivalent if the residual cross-covariances and, thus, error cross-covariances are symmetric (

The independence assumption introduces the following absolute uncertainties

The absolute uncertainty in the estimates similarly depends on the (neglected) error cross-covariances or dependencies among the three datasets.
While the error dependencies to the two other datasets contribute positively, the dependency between the two others is subtracted. If these
dependencies cancel out (

Estimated error covariances might even contain negative values if error dependencies are large compared with the true error covariance of a dataset. If
the true error covariances differ significantly among highly correlated datasets, the neglected error dependency between two datasets might become
much larger than the smaller error covariance, e.g.,

While independence among all datasets is required to estimate the error covariances of three datasets (

As described in Sect.

For more than three datasets (

Similar to the estimation for three datasets (

Based on this, the remaining error covariances can be calculated sequentially. For each additional dataset

Similarly, each additional error covariance can be estimated from two residual cross-covariances with respect to its reference dataset

From the equivalence of residual statistics in Eq. (

Once the error covariances are estimated, the remaining residual covariances can be used to calculate the error dependencies to all other prior
datasets

In contrast to residual covariances, the asymmetric formulation of residual cross-covariances allows for an estimation of remaining error
cross-covariances, including their asymmetric components. The error cross-covariance to each other prior dataset

Based on this, the symmetric error dependencies can be estimated from their definition in Eq. (

Note that the error cross-covariances

As a generalization of Eq. (

Due to the changing sign of error dependencies along the series of datasets, the absolute uncertainty in the error covariance estimates does not
necessary increase with the size of the polygon

The absolute uncertainty

The two sequential estimates of error covariances from residual covariances in Eq. (

With this, a series of reference datasets

According to Eq. (

Although Eq. (

The absolute uncertainties in the estimates of additional error cross-covariances based on residual cross-covariances can be determined recursively
using Eq. (

In contrast to error covariances, the uncertainties in the error cross-covariances sum up in the two series of reference datasets. However, this sum is subtracted by the two sums of uncertainties in error covariances of these datasets, whose elements may cancel partially (not shown).

It can be shown that the sequential formulation of an error covariance from its reference dataset is consistent with the triangular formulation from
three independent datasets (in Sect.

This can also be generalized for the estimation of any error covariance

The consistency between direct and sequential error covariance estimates results directly from their common underlying definition of residual
covariances in Eq. (

Instead of using the sequential estimation for additional datasets

The sequential estimation of an error covariance becomes favorable if the error covariance estimate of its reference dataset is as least as accurate
as the assumed dependency between these two datasets

Note that the absolute uncertainties presented here only account for uncertainties due to the underlying assumptions regarding error cross-statistics and not
due to imperfect residual statistics occurring, e.g., from finite sampling. A discussion of those effects for scalar problems can be found in

This section illustrates the capability to estimate full error covariance matrices for all datasets and some error dependencies. Three different
experiments are presented with four collocated datasets (

The error statistics of the four datasets consist of 10 matrices (

The three experiments differ with respect to the true error dependency between datasets

From the six residual covariances given, all four error covariances and two error dependencies can be estimated (

Based on this, the first setup uses a sequential estimation of the error covariance of the additional dataset 4 with respect to its reference dataset 1 from
Eq. (

In the following, the accuracy of the estimated error statistics from the two setups is evaluated for each experiment. In the first experiment in
Sect.

The plots in Figs. 2–4 are structured as follows: each subplot combines two covariance matrices – one shown in the upper-left part and the other in the lower-right part. Because all matrices involved are symmetric, it is sufficient to show only one-half of each matrix. The two matrices are separated by a thick diagonal gray bar and shifted off-diagonal so that diagonal variances are right above or below the gray bar, respectively. Statistics that might become negative are shown as absolute quantities in order to show them using the same color code. In each row, the upper-left parts are matrices that are usually unknown in real applications (as they require knowledge of the truth) and the lower-right parts are known/estimated matrices. The first row contains the error dependencies and residual covariances of each dataset pair. Here, gray asterisks in the upper-left subplot indicate that these error dependency matrices are assumed to be zero in the estimation. The second row contains the true and estimated error covariances and dependencies. The third row gives the absolute difference between the true and estimated matrices. Note that the lower-right part of each subplot in the third row does not contain any data.

Experiment 1: covariance matrices for four datasets (

Figure

By construction, the true error dependency matrices within the basic triangle – i.e., between

In contrast, the additional triangular estimate in the second setup assumes an additional independence between datasets

Experiment 2: covariance matrices for four datasets (

This experiment demonstrates the potential to accurately estimate complete error covariances and some dependencies for more than three datasets if the underlying assumptions are sufficiently fulfilled. Note that this accurate estimation is independent of the complexity of the statistics like spatial variations or correlations. It also shows that an inaccurate independence assumption in an error covariance – here in the additional triangular estimation – may introduce uncertainties in all subsequent estimates of error covariances and dependencies, which is in accordance with the theoretical formulations above. The comparison of the two setups demonstrates the advantage of the sequential estimation for more than three datasets compared with only using triangular estimations.

Experiment 3: covariance matrices for four datasets (

Figures

Because both setups use the independent triangle

The two setups differ with respect to the estimation of the error covariance of dataset 4, which affects the estimated dependencies

For both setups, the uncertainties in the two estimated error dependencies

Consequently, the sequential estimation of the additional dataset 4 is more accurate in this experiment because the uncertainties in the basic
triangle

This changes in the third experiment in Fig.

The same holds for the error covariance estimates in the basic triangle

The more accurate error covariance estimate of dataset 4 with the second setup also leads to more accurate estimates of the two error
dependencies

The large variation in the uncertainties in the error estimates from the two setups among the different experiments demonstrates the importance of selecting
an appropriate setup for the error estimation problem, which will be discussed in Sect.

This section provides a summary of the statistical error estimation method proposed in this study, with focus on its technical application.
Section

This section provides a conceptual discussion of different conditions that need to be fulfilled in order to be able to solve the error estimation problem. The discussion is based on the previous sections, but it is formulated in a qualitative way without providing mathematical details.

For error statistics that need to be assumed, their specific formulation may have different forms. The easiest and most common assumption is to set
their error correlations and, thus, the error cross-covariances and dependencies to zero. This assumption (used in Sect.

The number of error statistics that can be estimated for a given number of datasets (

In the first step, some error covariances need to be estimated directly “from scratch”, i.e., with no other error covariances available. Given the
basic formulation of residual covariances in Eq. (

However, the resulting equation that involves a closed series of residuals cannot always be solved for the initial error covariance. For less than
three residuals involved (

In addition, the error cross-covariances or dependencies between each involved dataset pair have to be assumed in order to close the estimation
problem. Thus, the initial error covariance can only be estimated from a closed series of

In the second step, all remaining error covariances can be estimated sequentially from their residual to a prior dataset – denoted as the reference
dataset – with previously estimated error covariance (see Sect.

Based on this, two general rules for the setup of datasets can be formulated that ensure the solvability of the problem in the case that all error
covariances and as many error cross-statistics as possible (cross-covariances or dependencies) are estimated:

all error cross-statistics along a closed series of dataset pairs, for which the number of involved datasets is odd and larger than or equal to three, are needed (this closed series of datasets is called the “basic polygon” or “basic triangle” in the case of three datasets) and

at least one error cross-statistic of each additional dataset to any prior datasets is needed (this prior dataset is called the “reference dataset” of the additional dataset).

Previously,

An illustrative example of assumed dependencies for

Independence tree: illustrative example of assumed error dependencies (gray lines) between 10 datasets (colored dots). The assumed dependencies in the basic triangle (1; 2; 3) are indicated by thicker lines. An alternative setup with a basic pentagon is indicated using the dotted line (3; 6) instead of the lighter gray line (1; 3).

The general rules given in Sect.

Improved independence tree: the same as Fig.

The relative accuracy of an error covariance estimate is proportional to the ratio between the residual covariance

The maximal residual-to-dependency ratio is equivalent to the minimal uncertainty in the normalized error correlations

While uncertainties in the basic polygon only contribute half to the subsequent uncertainties, they affect the estimations of error statistics of all subsequent datasets (see Sect.

Furthermore, it is also possible to average the estimated error statistics of a dataset from multiple pairwise independent polygons, similar to an
application of the N-cornered hat method

Despite the generalized matrix formulation, the main features of the presented approach are (i) its generality defining a flexible setup for any
number of datasets according to the specific application, (ii) its optimality with respect to a minimal number of assumptions required, and (iii) its
suitability to include expected nonzero dependencies between any pair of datasets. In contrast, the scalar N-CH method (N-cornered hat method)
averages all estimates of each dataset, which is equivalent to assuming that the independence assumption among each dataset triplet is fulfilled with
the same accuracy. However, this is not the case for most applications to geophysical datasets. For example,

An important application of the presented method is expected to be numerical weather prediction (NWP), where short-term forecasts from multiple national centers can be used to estimate the error statistics required for data assimilation. In contrast to previous statistical methods, potential dependencies among the forecasts, i.e., due to the assimilation of similar observations, can be considered in the error estimation and even explicitly quantified. Future work will show how this statistical approach compares to state-of-the-art background error estimates based on computationally expensive Monte Carlo-based or ensemble-based methods. While the presented method can be formulated to provide symmetric error covariances, a risk remains that negative values might occur for real applications due to inaccurate assumptions or sampling uncertainties.

In comparison to a posteriori methods that statistically estimate optimal error covariances for data assimilation, an a priori error estimation of
collocated datasets has three main advantages: (i) optimal error statistics are calculated analytically without requiring an iterative minimization
including multiple executions of the assimilation, (ii) complete covariance matrices provide spatially resolved fields of error statistics at each
collocated location including spatial- and cross-species correlations, and (iii) error statistics of all datasets are estimated without selecting one
dataset as a reference. This enables the consideration of more than two datasets in the assimilation. Given sufficiently estimated error statistics,
the final analysis with respect to all datasets will be closer to the truth than any analysis between two datasets only. Thus, the rapidly increasing
number of geophysical observations and model forecasts enable improved analyses due to increasingly overlapping datasets, and the optimal error
statistics can be calculated, for example, with the method presented here. Specifically, the possibility to estimate optimal error cross-covariances
between datasets provides important information for data assimilation, in which the violation of the independence assumption remains a major challenge

However, current data assimilation schemes are not suited for multiple overlapping datasets, and cross-errors between datasets are assumed to be negligible. In contrast, the statistical error estimation method presented in this study is explicitly tailored to multiple datasets that cannot be assumed to be independent. Thus, the estimated error covariances are not consistent with assimilation algorithms assuming (two) independent datasets. If the estimated error dependencies among all assimilated datasets are small, the independence assumption may be regarded as sufficiently fulfilled. The error estimation method then provides error covariances for assimilation and information on the accuracy of the independence assumption. Otherwise, generalized assimilation schemes need to be developed for a proper use of this additional statistical information in data assimilation. Although this increases complexity, such generalized assimilation schemes enable fundamental improvements in terms of an optimal analysis from multiple datasets with respect to their error covariances and cross-statistics.

The general estimation procedure of error statistics for

Iterative calculation of error covariances and dependencies for

Iterative calculation of error covariances and cross-covariances for

Algorithm

The equations relate to the general exact formulations, which require some error dependencies or cross-covariances to be given (see Sect.

The data from the three synthetic experiments are available at

AV developed the approach, derived the theory, performed the experiments, and wrote the manuscript. RM supervised the work and revised the manuscript.

The contact author has declared that none of the authors has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors thank the editor, Olivier Talagrand; Ricardo Todling; and the two anonymous reviewers for their exceptionally thoughtful and valuable feedback on the manuscript.

This paper was edited by Olivier Talagrand and reviewed by Ricardo Todling and two anonymous referees.