Revising the stochastic iterative ensemble smoother

Ensemble randomized maximum likelihood (EnRML) is an iterative (stochastic) ensemble smoother, used for large and nonlinear inverse problems, such as history matching and data assimilation. Its current formulation is overly complicated and has issues with computational costs, noise, and covariance localization, even causing some practitioners to omit crucial prior information. This paper resolves these difficulties and streamlines the algorithm, without changing its output. These simplifications are achieved through the careful treatment of the linearizations and subspaces. For example, it is shown (a) how ensemble linearizations relate to average sensitivity, and (b) that the ensemble does not loose rank during updates. The paper also draws significantly on the theory of the (deterministic) iterative ensemble Kalman smoother (IEnKS). Comparative benchmarks are obtained with the Lorenz-96 model with these two smoothers and the ensemble smoother using multiple data assimilation (ES-MDA).

Consider the problem of estimating an unknown, high-dimensional parameter vector x ∈ R M , given the observation y ∈ R P .
It is assumed that where the (generic) forward/observation model, M, is known and typically nonlinear, and the observation error, δ, is random noise, giving rise to a likelihood, p(y|x).
In the Bayesian paradigm, prior information is quantified as a probability density function (pdf) called the prior, denoted p(x), and the truth, x, is considered a draw thereof. The inverse problem then consists of computing and representing the 5 posterior which, in principle, is given by pointwise multiplication: quantifying the updated estimation of x. Due to the noted high-dimensionality and nonlinearity, this can be challenging, necessitating approximate solutions.
The prior is assumed Gaussian, with mean µ x and covariance C x , i.e.
For now, the prior covariance matrix, C x , is assumed invertible such that the corresponding norm, x 2 Cx = x T C −1 x x, is defined. Note that vectors are taken to have column orientation, and that x T denotes the transpose.
The observation error, δ, is assumed drawn from: whose covariance, C δ , will always be assumed invertible. Then, assuming δ and x are independent and recalling equation (1),

The algorithm
The Monte-Carlo approach offers a convenient representation of distributions as samples. Here, the prior is represented by 20 the "prior ensemble", {x n } N n=1 , whose members (sample points) are assumed independently drawn from it. RML is a relatively efficient method to approximately "condition" (i.e. implement (2) on) the prior ensemble, using optimization. Firstly, an ensemble of perturbed observations, {y n } N n=1 , is generated as y n = y + δ n , where δ n is independently drawn according to equation (4).
Then, the n-th "randomized log-posterior", J x,n , is defined by Bayes' rule (2), except with the prior mean and the observation 25 replaced by the n-th members of the prior and observation ensembles: The two terms are referred to as the model mismatch (log-prior) and data mismatch (log-likelihood), respectively.
Finally, these log-posteriors are minimized. Using the Gauss-Newton iterative scheme (for example) requires (7a) its gradient and (7b) a first-order approximation to its Hessian, both evaluated at the current iterate, labelled x n,i for each member n and 30 iteration i. To simplify the notation, define x • = x n,i . Objects evaluated at x • are similarly denoted; for instance, M • = M (x • ) denotes the Jacobian of M evaluated at x • , and Application of the Gauss-Newton scheme yields: where the prior (or model) and likelihood (or data) increments are respectively given by: which can be called the "precision matrix" form.

40
Alternatively, by corollaries of the well known Woodbury identity, the increments can be written in the "Kalman gain" form: where K • is the gain matrix: with As the subscript suggests, C y may be identified as the prior covariance of the observation, equation (1). Note that if P M , then the inversion of C y for the Kalman gain form is significantly cheaper than the inversion to compute C • .

EnRML
Ensemble-RML (EnRML) is an approximation of RML where the ensemble is used in its own update, by estimating C x and M • . This section derives EnRML, and gradually introduces the new improvements.
Computationally, compared to RML, EnRML offers the simultaneous benefits of working with low-rank representations of covariances, and not requiring a tangent-linear (or adjoint) model. Both advantages will be further exploited in the new 55 formulation of EnRML.
Concerning their sampling properties, a few points can be made. Firstly (due to the ensemble covariance), EnRML is biased for finite N , even for a linear-Gaussian problem, for which RML will sample the posterior correctly. This bias arises for the same reasons as in the ensemble Kalman filter (EnKF, van Leeuwen, 1999;Sacher and Bartello, 2008). Secondly (due to the ensemble linearization), EnRML effectively smoothes the likelihood. It is therefore less prone to getting trapped in 60 local maxima of the posterior (Chen and Oliver, 2012). Sakov et al. (2017) explain this by drawing an analogy to the secant method, as compared to the Newton method. Hence, it may reasonably be expected that EnRML yield constructive results if the probability mass of the exact posterior is concentrated around its global maximum. Although this regularity condition is rather vague, it would require that the model be "not too nonlinear" in this neighbourhood. Conversely, EnRML is wholly inept at reflecting multimodality introduced through the likelihood, and so RML may be better suited when local modes feature prominently, as is quite common in problems of subsurface flow (Oliver and Chen, 2011). However, while RML has the ability to sample multiple modes, it is difficult to predict to what extent their relative proportions will be correct (without the costly use of Metropolis-Hastings). Further comparison of the sampling properties of RML and EnRML was done by Evensen (2018b).

Ensemble preliminaries
For convenience, define the concatenations: which are known as the "ensemble matrix" and the "perturbation matrix", respectively.
Projections sometimes appear through the use of linear regression. We therefore recall (Trefethen and Bau, 1997) that a (square) matrix Π is an orthogonal projector if For any matrix A, let Π A denote the projector whose image is the column space of A, implying that The (Moore-Penrose) pseudo-inverse, A + , may be used to express the projector: Here, the second equality follows from the first by equation (15) and (A + ) T = (A T ) + . The formulae simplify further in terms of the SVD of A.
Now, denote 1 ∈ R N the (column) vector of ones, and let I N be the N -by-N identity matrix. The matrix of anomalies, X, is defined and computed by subtracting the ensemble mean, x = E1/N , from each column of E. It should be appreciated that 85 this amounts to the projection: where Π ⊥ 1 = I N − Π 1 , with Π 1 = 11 T /N .
Definition 1 (The ensemble subspace). The flat (i.e. affine subspace) given by: Similarly to section 2, iteration index (i > 0) subscripting on E, X, and other objects, is used to indicate that they are 90 conditional (i.e. posterior). The iterations are initialized with the prior ensemble: x n,0 = x n .

The constituent estimates
The ensemble estimates of C x and M • are the building blocks of the EnRML algorithm. The canonical estimators are used, namely the sample covariance (19a), and the least-squares linear regression coefficients (19b). They are denoted with the overhead bar: The anomalies at iteration i are again given by Π ⊥ 1 X + i = X + i , which follows from Π(AΠ) + = (AΠ) + (valid for any matrix A and projector Π Maciejewski and Klein, 1985).
Note that the linearization (previously M • , now M i ) no longer depends on the ensemble index, n. Indeed, it has been called "average sensitivity" since the work of Zafari et al. (2005); Reynolds et al. (2006); Gu et al. (2007). The formula (19b) for M i is sometimes arrived at via a Taylor expansion of M around x i , but this requires further, indeterminate approximations 10 to obtain any other interpretation than M (x i ): the Jacobian evaluated at the ensemble mean. Instead, the "average sensitivity/derivative/gradient" description suggest that where the subscript i has been temporarily dropped for clarity. However, equation (20) does not appear to have been spelled out in the literature, and the sense in which it holds has not yet been established; this is accomplished by Theorem 1.

15
Theorem 1 (Regression coefficients versus derivatives). Let x be drawn from the distribution of the ensemble (e.g., the prior or posterior of any iteration): a Gaussian. Then with "almost sure" convergence, and expectation (E) in x. Regularity conditions and proof in appendix A. Note: the expectation could also be defined using the ensemble itself, since

20
Note that the generality of Theorem 1 is restricted by its Gaussianity assumption. Thus, for generality and precision, M i should simply be labelled "the least-squares (linear) fit" of M, based on E i .
Finally, note that the computation (19b) of M i seemingly requires calculating a new pseudo-inverse, X + i , at each iteration, i; this is addressed in section 3.6.
The prior covariance estimate (previously C x , now C x ) is not assumed invertible, in contrast to section 2. It is then not 25 possible to employ the precision matrix forms (9) because C −1 x is not defined. Using the C + x in its stead is flawed and damaging because it is zero in the directions orthogonal to the ensemble subspace, so that its use would imply that the prior is assumed infinitely uncertain (i.e. flat) as opposed to infinitely certain (like a delta function) in those directions. Instead, as shown in the following, one should employ ensemble subspace formulae, or equivalently, the Kalman gain form. The ensemble estimates (19) are now substituted into the Kalman gain form of the update, equation (10) to (12). The ensemble estimate of the gain matrix, denoted K i , thus becomes: where Y i has been defined as the prior (i.e. unconditioned) anomalies, under the action of the i-th iterate linearization:

Estimating the Kalman gain
A Woodbury corollary (again, no implicit pseudo-inverting), can be used to express K i as: with 40 The reason for labelling this matrix with the subscript w is revealed later. For now, note that, in the common case of N P , the inversion in equation (25) is significantly cheaper than the inversion in equation (22). Another computational benefit is that C w,i is non-dimensional, meaning that data with small magnitude will not be "perceived" as noise by numerical decomposition routines.
In conclusion, the likelihood increment (10b) is now estimated as: This is efficient because M i does not explicitly appear in K i (neither in formula (22) nor (24)), even though it is implicitly present through Y i (23), where it multiplies X. This absence (a) is reassuring, as the product Y i constitutes a less noisy estimate than just M i alone (Chen and Oliver, 2012;Emerick and Reynolds, 2013b, figures 2 and 27, resp.); (b) constitutes a computational advantage, as will be shown in section 3.6; (c) enables leaving the type of linearization made for M unspecified, 50 as is usually the case in EnKF literature.

Estimating the prior increment
In contrast to the likelihood increment (10b), the Kalman gain form of the prior increment (10a) explicitly contains the sensitivity matrix, M • . In response, consider the change of variables:
Denote w • the control vector such that x(w • ) = x • , and note that x(e n ) = x n , where e n is the n-th column of the identity , and the prior increment (10a) with the ensemble estimates becomes:

60
where there is no explicit M i , which only appears implicitly through Y i = M i X, as defined in equation (23) Alternatively, applying the subspace formula (24) and using I N = C w,i (C w,i ) −1 yields: 3.5 Justifying the change of variables Lemma 1 may be proven by noting that X is the leftmost factor in K i , and using induction on equations (10a) and (10b).
Alternatively, it can be deduced (Raanes et al., 2018) as a consequence of the implicit assumption on the prior that x ∼ N (x, C x ). A stronger result, namely col(X i ) = col(X), is conjectured in appendix A, but Lemma 1 is sufficient for the present purposes: it implies that there exists w • ∈ R N such that x(w • ) = x • for any ensemble member and any iteration. Thus, 70 the lemma justifies the change of variables (27).
Moreover, using the ensemble control vector (w) is theoretically advantageous as it inherently embodies the restriction to the ensemble subspace. A practical advantage is that w is relatively low-dimensional compared to x, which lowers storage and accessing expenses.

Simplifying the regression 75
Recall the definition of equation (23): Avoiding the explicit computation of M i used in this product between the iteration-i estimate M i and the initial (prior) X was the motivation behind the modification C x ← C x,i by Chen and Oliver (2013b). Here, instead, by simplifying the expression of the regression, it is shown how to compute Y i without first computing M i .

The transform matrix 80
Inserting the regression M i (19b) into the definition (23), where T + i = X + i X has been defined, apparently requiring the pseudo-inversion of X i for each i. But, as shown in appendix A2, which only requires the one-time pseudo-inversion of the prior anomalies, X. Then, since the pseudo-inversion of is a relatively small calculation, this saves computational time.
The symbol T has been chosen in reference to deterministic, square-root EnKFs. Indeed, pre-multiplying equation (31) by X and recalling equation (17) and Lemma 1 produces X i = XT i . Therefore, the "transform matrix", T i , describes the conditioning of the anomalies (and covariance).
Inversely, equation (30) can be seen as the "de-conditioning" of the posterior observation anomalies. This interpretation of Y i 90 should be contrasted to its definition (23), which presents it as the prior parameter anomalies "propagated" by the linearization of iteration i. The two approaches are known to be "mainly equivalent" in the deterministic case . To our knowledge, however, it has not been exploited for EnRML before now, possibly because the proofs (appendix A2) are a little more complicated in this stochastic case.

From the ensemble controls
The ensemble matrix of iteration i can be written: where the columns of W i ∈ R N ×N are the ensemble control vectors (27). Post-multiplying equation (32) by Π ⊥ 1 to get the anomalies produces: This seems to indicate that W i Π ⊥ 1 is the transform matrix, T i , discussed in the previous subsection. However, they are not 10 fully equal: inserting X i from (33) into (31) yields: i.e. they are distinguished by Π X T = X + X: the projection onto the row space of X.
Appendix A3 shows that, in most conditions, this pesky projection matrix vanishes when T i is used in equation (30):

15
In other words, the projection Π X T can be omitted unless M is nonlinear and the ensemble is larger than the unknown parameter's dimensionality.
A well known result of Reynolds et al. (2006) is that the first step of the EnRML algorithm (with W 0 = I N ) is equivalent to the EnKF. However, the standard definition of the EnKF uses cross-covariances rather than an explicit M 0 to define the Kalman gain, and this corresponds to a Y 0 that never contains Π X T . The following section explains why it should be so for 20 EnRML too.

Linearization chaining
Consider applying the change of variables (27) to w at the very beginning of the derivation of EnRML. Since X1 = 0, there is a redundant degree of freedom in w, meaning that there is a choice to be made in deriving its density from the original one, given by J x,n (x) in equation (6). The simplest choice (Bocquet et al., 2015) results in the log-posterior: Application of the Gauss-Newton scheme with the gradients and Hessian of J w,n , followed by a reversion to x, produces the EnRML algorithm as developed above.
The derivation summarized in the previous paragraph is arguably simpler than that of the last few pages. Notably, (a Therefore, the definition Numerical experiments, as in section 4 but not shown, indicate no statistically significant advantage for either version. This corroborates similar findings by Sakov et al. (2012) for the deterministic flavour. Nevertheless, there is a practical advantage: avoiding the computation of Π X T .

Inverting the transform
In square-root ensemble filters, the transform matrix should have 1 as an eigenvector (Sakov and Oke, 2008;Livings et al., 2008). By construction, this also holds true for W i Π ⊥ 1 , with eigenvalue 0. Now, consider adding 0 = XΠ 1 to equation (33), yielding another valid transformation:

50
The matrix Ω i , in contrast to W i Π ⊥ 1 and T i , has eigenvalue 1 for 1, and can be shown to be invertible (Lemma 2, appendix A3). This is convenient for proving equation (35), as is done in appendix A3, where Y i is initially expressed in terms of Ω −1 i . Note, however, that this version requires centring M(E i ) before post-multiplying by Ω −1 i . In real applications it is commonplace to use a stable linear solver in place of any inversion. Reflecting this, Algorithm 1 persists with (W i Π ⊥ 1 ) + rather than Ω −1 i . However, in this pseudo-inversion, all N −1 non-zero singular values should be 55 retained, and no truncation threshold should be used, because all components of W i are equally important (unlike the old EnRML algorithm, where a X and/or X i was scaled decomposed, and truncated). If, by extreme chance (cf. the conjecture in appendix A) combined with poor numerical precision or subroutines, the rank of W i or its pseudo-inversion is lower, this will invalidate J w,n and the algorithm unless compensated for by pre-multiplying the prior increment on line 8 by that same projection.

Algorithm
To summarize, Algorithm 1 provides pseudo-code for the new EnRML formulation. The increments ∆ lklhd (26) and ∆ prior (29) can be recognized by pre-multiplying line 10 by X. For aesthetics, the sign of the gradients has been reversed. Note that there is no need for an explicit iteration index. Nor is there an ensemble index, n, since all N columns are stacked into the matrix W.
However, in case M is large, Y may be computed column-by-column to avoid storing E. The product WΠ ⊥ 1 is computed by 65 subtracting the column mean of W. Its pseudo-inverse on line 6 should retain all N −1 non-zero singular values, as discussed in section 3.6.4. Line 9 may be computed using a reduced or truncated SVD of C −1/2 δ Y, which is relatively fast for N both larger and smaller than P . Alternatively, the Kalman gain forms could be used. The new EnRML algorithm produces results that are identical to the old formulation, at least up to round-off and truncation errors, and for N − 1 ≤ M . Therefore, since there is already a large number of studies of EnRML with reservoir cases (e.g., Chen and Oliver, 2013a;Emerick and Reynolds, 2013b), adding to this does not seem necessary.
However, there does not appear to be any studies of EnRML with the Lorenz-96 system (Lorenz, 1996) in a data assimilation setting. The advantages of this case are numerous: (a) the model is a surrogate of weather dynamics, and as such holds relevance 80 in geoscience; (b) the problem is (exhaustively) sampled from the system's invariant measure, rather than being selected by the experimenter; (c) the sequential nature of data assimilation inherently tests prediction skill, which helps avoid the pitfalls of point measure assessment, such as overfitting; (d) its simplicity enhances reliability and reproducibility, and has made it a literature standard, thus facilitating comparative studies.
Comparison of the benchmark performance of EnRML will be made to the IEnKS, and ensemble multiple data assimilation 85 (ES-MDA) 1 , both its stochastic and deterministic (square-root) flavour. Not included in the benchmark comparisons is the version of EnRML where the prior increment is dropped (cf. section 1.1). This is because the chaotic, sequential nature of this case makes it practically impossible to achieve good results without propagating prior information. Similarly, as they lack a dynamic prior, this precludes "regularizing, iterative ensemble smoothers" (Iglesias, 2015), (Luo et al., 2015), 2 (Mandel et al., 2016) 3 , even if their background is well-tuned, and their stopping condition judicious. Because they require the tangent-linear 90 model, M • , RML and EDA/En4DVar (Tian et al., 2008;Bonavita et al., 2012;Jardak and Talagrand, 2018) are not included.
For simplicity, localization will not be used, nor covariance hybridization. Other, related methods may be found in the reviews of Bannister (2016); Carrassi et al. (2018).

Setup
The performances of the iterative ensemble smoother methods are benchmarked with "twin experiments", using the Lorenz-96 dynamical system, which is configured with standard settings (e.g., Ott et al., 2004;Bocquet and Sakov, 2014), detailed below.
10 1 Note that this is MDA in the sense of Emerick and Reynolds (2013a), where the annealing itself yields iterations, and not in the sense of quasi-static assimilation (Pires et al., 1996;Bocquet and Sakov, 2014;Fillion et al., 2018), where it is used as an auxiliary technique. 2 Their Lorenz-96 experiment only concerns the initial conditions. 3 Their Lorenz-96 experiment seems to have failed completely, with most of the benchmark scores (their Figure 5) indicating divergence, which makes it pointless to compare benchmarks. Also, when reproducing their experiment, we obtain much lower scores than they report for the EnKF. One possible explanation is that we include, and tune, inflation.

13
Nonlin. Processes Geophys. Discuss., https://doi.org/10.5194/npg-2019-10 Manuscript under review for journal Nonlin. Processes Geophys. The iterative smoothers are employed for the filtering problem, aiming to estimate x(t) as soon as y(t) comes in. In so doing, they also tackle the smoothing problem for x(t − L), where the data assimilation window has been fixed at L = 0.4, which is near optimal (cf. Bocquet et al., 2013, Figures 3 and 4). This window is shifted by 1 × ∆t obs each time a new observation becomes available. A post-analysis inflation factor is tuned for optimal performance for each smoother and each ensemble size, N . Also, random rotations are used to generate the ensembles for the square-root variants. The number of iterations is fixed, 15 either at 3 or 10. No tuning of the step length is undertaken: it is 1/3 or 1/10 for ES-MDA, and 1 for EnRML and the IEnKS.
The smoothers are assessed by their accuracy, as measured by root-mean squared error:  Figure 7a).

25
As we expect, Figure 1 shows that the performance of all of the smoothers improve with increasing N , which needs to be at least 15 for tolerable performance, corresponding to the rank of the unstable subspace of the dynamics plus one (Bocquet and Carrassi, 2017). Of course, all of the scores are lower for the left pane where ∆t obs = 0.2, compared to the right pane where ∆t obs = 0.4. The deterministic (square-root) IEnKS and ES-MDA score noticeably lower RMSE averages than the stochastic IEnKS 30 (i.e. EnRML) and ES-MDA, which require N closer to 30 for tolerable performance. Also tested (not shown) was the firstorder-approximate deterministic flavour of ES-MDA (Emerick, 2018), which systematically performed slightly worse than the square-root flavour.
It appears that 3 iterations is largely sufficient, since its markers are rarely significantly higher than those of 10 iterations, the exceptions all occurring when the ensemble size is close to the lower limit of the tolerable performance range.

35
Between the two stochastic smoothers (EnRML and stochastic ES-MDA) there is no clear-cut advantage. Among the deterministic smoothers, the IEnKS performs slightly better than ES-MDA, though this is hardly significant. This result came as a surprise because, in contrast with EnRML/IEnKS which can iterate indefinitely, we thought that ES-MDA would suffer from occasionally not "reaching" the optimum.
One explanation could be that EnRML/IEnKS need a lowering of the step lengths, possibly as a function of the iteration 40 number, to avoid causing "unphysical" states, and to avoid "bouncing around" near the optimum. Along with the related MDAinflation parameter, tuning of the step length has been subject of several studies (Chen and Oliver, 2012;  the reason why most expositions of the EnKF do not go the length of declaring that its implicit linearization of M is that of least-squares linear regression. Section 3.6.3 showed that the projection is merely the result of using the chain rule for indirect regression to the ensemble space, and argued that it is preferable to use the direct regression of the standard EnKF. The other focus of the derivation was rank issues, with C x not assumed invertible. Using the Woodbury matrix lemma, 60 and avoiding implicit pseudo-inversions and premature insertion of SVDs, it was shown that the rank deficiency invalidates the Hessian form of the RML update, which should be restricted to the ensemble subspace. On the other hand, the subspace form and Kalman gain form of the update remain equivalent and valid. Furthermore, Theorem 2 of appendix A prove that the ensemble does not lose rank during the updates of EnRML (or EnKF).
The paper has also drawn significantly on the theory of the deterministic counterpart to EnRML: the iterative ensemble Kalman smoother (IEnKS). Comparative benchmarks using the Lorenz-96 model with these two and the ensemble multiple data assimilation (ES-MDA) smoother were shown in section 4. As in the non-iterative case (e.g., Sakov and Oke, 2008), the deterministic smoothers achieved better accuracy than the stochastic methods. Surprisingly, there was is little performance difference between ES-MDA and EnRML/IEnKS. x , almost surely. The equality to E[M (x)] follows directly from "Stein's lemma" (Liu, 1994). 75 Theorem 2 (EnKF rank preservation). The posterior ensemble's covariance, obtained using the EnKF, has the same rank as the prior's, almost surely (a.s.).
Proof. The updated anomalies, both for the square-root and the stochastic EnKF, can be written X a = XT a for some T a ∈ R N ×N .
For the stochastic EnKF, equations (24) and (26) may be used to show that T a = (N −1)C w ΥΠ ⊥ 1 , with Υ = I N + Y T C −1 δ D/(N −1). Hence, for rank preservation, it will suffice to show that Υ is a.s. full rank. We begin by writing Υ more compactly: From equations (4), (14) and (A1) it can be seen that column n of Z follows the law z n ∼ N (0, I P /(N −1)). Hence, column n of Υ follows υ n ∼ N (e n , S T S/(N −1)), and has sample space: Since υ n is absolutely continuous with sampling space S n , equation (A3) means that the probability that υ n ∈ col([Υ :n−1 , I n: ]) is zero. This implies H n a.s., establishing the induction. Identifying the final hypothesis (H N ) with rank(Υ) = N concludes the proof for the EnKF.

5
A corollary of Theorem 2 and Lemma 1 is that the ensemble subspace is also unchanged by the EnKF update. Note that both the prior ensemble and the model (involved through Y) are arbitrary in Theorem 2. However, C δ is assumed invertible. The result is therefore quite different from the topic discussed by Kepert (2004); Evensen (2004), where rank deficiency arises due to a reduced-rank C δ .
Conjecture 1. The rank of the ensemble is preserved by the EnRML update (a.s.) and W i is invertible. 10 We were not able to prove Conjecture 1, but it seems a logical extension of Theorem 2, and is supported by numerical trials.
The following proofs utilize Conjecture 1, without which some projections will not vanish. Yet, even if Conjecture 1 should not hold (due to bugs, truncation, or really bad luck), Algorithm 1 is still valid and optimal, as discussed in sections 3.6.3 and 3.6.4.

15
Proof. Let T = X + X i and S = X + i X. The following shows that S satisfies the four properties of the Moore-Penrose characterization of the pseudo-inverse of T: 4. The symmetry of ST is shown as for point 3.
This proof was heavily inspired by appendix A of Sakov et al. (2012). However, our developments apply for EnRML (rather than the deterministic, square-root IEnKS). This means that T i is not symmetric, which complicates the proof in that the focus must be on X + X i rather than X + i alone. Our result also shows the equivalence of S + and T in general, while the additional 25 result of the vanishing projection matrix in the case of N − 1 ≤ M is treated as a corollary, shown in the following.

30
Recall that equation (34) was obtained by inserting X i in the expression (31) for T i . The following uses the alternative of inserting X in the expression (30) for T + i . By equation (36) and Lemma 2, X = X i Ω −1 i and so T + i = Π X T i Ω −1 i . We now re-introduce Π ⊥ 1 , which was omitted for equation (19b), by prepending it to T + i ; this does not change its value. In summary, equation (30) becomes: Next, it is shown that, under certain conditions, the projection matrix Π X T i vanishes: Thereafter, appendix A4 can be used to write Ω −1 i in terms of (W i Π ⊥ 1 ) + .
The case of N −1 ≤ M In the case of N −1 ≤ M , the null space of X is the range of 1 (with probability 1, Muirhead, 1982, Theorem 3.1.4). By 40 Lemma 2, the same applies for X i , and so Π X T i in equation (A4) reduces to Π ⊥ 1 .

The case of linearity
Let M be the matrix of the observation model M, here assumed linear: M(E i ) = ME i . By equation (A4), A4 The pseudo-inverse version 45 The results of this section do not depend on whether the projection Π X is included in Y i or not. Either way, Y i 1 = 0, and so C −1 w,i 1 = 1 = C w,i 1 , where C w,i is defined in equation (25), and the second equality follows from the first. Similarly, the following identities are valid also when W i and W −1 i are swapped.
Equation (A7) is proven inductively (in i) using (A6) in line 10 of Algorithm 1. It enables showing (A8), using Π ⊥ 1 = I N −Π 1 . This enables showing (A9), similarly to Theorem 3. These identities can then be used to verify (by multiplying with Ω i ) that Substituting this formula for Ω −1 i into equation (A5) then reduces it to the pseudo-inverse version (35). As for equation (19b), the projection Π ⊥ 1 can again be omitted.