A Bayesian Approach to Multivariate Adaptive Localization in Ensemble-Based Data Assimilation with Time-Dependent Extensions

Ever since its inception, the Ensemble Kalman Filter has elicited many heuristic methods that sought to correct it. One such method is localization---the thought that `nearby' variables should be highly correlated with `far away' variable not. Recognizing that correlation is a time-dependent property, adaptive localization is a natural extension to these heuristics. We propose a Bayesian approach to adaptive Schur-product localization for the DEnKF, and extend it to support multiple radii of influence. We test both the empirical validity of (multivariate) adaptive localization, and of our approach. We test a simple toy problem (Lorenz'96), extending it to a multivariate model, and a more realistic geophysical problem (1.5 Layer Quasi-Geostrophic). We show that the multivariate approach has great promise on the toy problem, and that the univariate approach leads to improved filter performance for the realistic geophysical problem.


Introduction
Data assimilation (Asch et al., 2016;Law et al., 2015;Evensen, 2009;Reich and Cotter, 2015) fuses information from the model forecast states (1) and observations (3) in order to obtain an improved estimation of the truth at any given point in time. Data assimilation approaches include the Ensemble Kalman Filters (EnKF) (Evensen, 1994(Evensen, , 2009Constantinescu et al., 2007b) that rely on Gaussian assumptions, particle filters for non-Gaussian distributions (Van Leeuwen et al., 2015;Attia et al., 2017;Attia and Sandu, 2015), and variational approaches, rooted in control theory (Le Dimet and Talagrand, 1986;Sandu et al., 2005;Carmichael et al., 2008) 15 The EnKF is an important family of data assimilation techniques that propagate both the mean and covariance of the state uncertainty (2) through the model (1) using a Monte-Carlo approach. While large dynamical systems of interest have a large number of modes along which errors can grow, the number of ensemble members used to characterize uncertainty remains relatively small due to computational costs. As a result, inaccurate correlation estimates obtained through Monte Carlo sampling can profoundly affect the filter results. Techniques such as covariance localization and inflation have been developed to 20 alleviate these problems (Petrie and Dance, 2010).
Localization techniques take advantage of the fundamental property of geophysical systems that correlations between variables decrease with spatial distance (Kalnay, 2003;Asch et al., 2016). This prior knowledge is used to scale ensemble-estimated covariances between distant variable such as to reduce inaccurate, spurious correlations. The prototypical approach to local-ization is a Schur-product-based tapering of the covariance matrix (Bickel et al., 2008); theoretical results ensure that the covariance matrices estimated using small ensembles sampled from a multivariate normal probability distribution, upon tapering, approach quickly the true covariance matrix. A formal theory of localization (Flowerdew, 2015) has been attempted, though a true multivariate theory is still out of our grasp. Practical implementations of localization rely on restricting the information flow, either in state space or in observation space, to take place within a given "radius of influence" (Hunt et al., 5 2007). Some variants of EnKF like the Maximum Likelihood Ensemble Filter (MLEF) (Zupanski, 2005) reduce the need for localization, while others use localization in order to efficiently parallelize the analysis cycles in space (Nino-Ruiz et al., 2015).
The performance of the EnKF algorithms critically depends on the correct choice of localization radii (a.k.a, the decorrelation distances), since values that are too large fail to correct for spurious correlations, while values that are too small throw away important correlation information. However, the physical values of the spatial decorrelation scales are not known apri- 10 ori, and they change with the temporal and spatial location. At the very least the decorrelation scales depend on the current atmospheric flow. In atmospheric chemistry systems, because of the drastic difference in reactivity, each chemical species has its own individual localization radius (Constantinescu et al., 2007a). Multi-scale schemes (Buehner and Shlyaeva, 2015) for localization are an immediate necessity. Clearly, the widely used approach of estimating decorrelation distances from historical ensembles of simulations, and then using a constant averaged value throughout the space and time domain, lead to a suboptimal 15 performance of Ensemble Kalman filters.
Adaptive localization schemes seek to estimate decorrelation distances from the data, such as to optimize the filter performance according to some criteria. One approach to adaptive localization utilizes an ensemble of ensembles to detect and mitigate spurious correlations (Anderson, 2007). Relying purely on model dynamics and foregoing reliance on spatial properties of the model, the method is very effective for small scale systems, but its applicability to large-scale geophysical problems 20 is unclear. There is, however, evidence that optimal localization depends more on ensemble size and observation properties than on model dynamics (Kirchgessner et al., 2014), and that adaptive approaches whose correlation functions follow these dynamics do not show a significant improvement over conventional static localization (Bishop and Hodyss, 2011). Methods such as sampling error correction (Anderson, 2012) take advantage of these properties to build correction factors and apply them as an ordinary localization scheme. A different approach uses the inherent properties of correlation matrices to construct Smoothed 25 ENsemble COrrelations Raised to a Power (SENCORP) (Bishop and Hodyss, 2007) matrices, that in the limiting case remove all spurious correlations. This method relies purely on the statistical properties of correlation matrices, and ignores the model dynamics and the spatial and temporal dependencies it defines. A more theoretically sound approach (Ménétrier et al., 2015) would try to create a localization matrix such that the end result would better approximate the asymptotic infinite-ensemble covariance matrix. A recent approach considers machine learning algorithms to capture hidden properties in the propagation 30 model that affect the localization parameters (Moosavi et al., 2018).
This work develops a Bayesian framework to dynamically learn the parameters of the Schur-product based localization from the ensemble of model states and the observations during the data assimilation in geophysical systems. Specifically, the localization radii are considered random variables described by parametrized distributions, and are retrieved as part of the assimilation step together with the analysis states. One of the primary goals of this paper is to develop ways in which such 35 an approach could be extended to both multivariate and time-dependent 4D-esque cases. We prove the approach's empirical validity through a type of idealized variance minimization that has access to the true solution (which we call an Oracle). We then show that the approach provides a more stable result with a much larger initial radius guess. The exploration of the idea is done through the use of several test problems such as that of the Lorenz'96 problem, a multivariate variant thereof that we introduce specifically for this paper, and a more realistic Quasi-Geostrophic model to showcase the applicability of the method 5 to scenarios more in line with operational ones.
The paper is organized as follows. Sect. 2 reviews background material for Ensemble Kalman filtering and Schur-product localization. Sect. 4 describes the proposed theoretical framework for localization and the resulting optimization problems for maximum likelihood solutions. Numerical results with three test problems reported in Sect. 5 provide empirical validation of the proposed approach. 10

Background
We consider a computational model that approximates the evolution of a physical dynamical system such as the atmosphere: The (finite-dimensional) state of the model x i ∈ R n at time t i approximates the (finite-dimensional projection of the) physical true state x t i ∈ R n . The computational model (1) is inexact, and we assume the model error to be additive and normally The initial state of the model is also not precisely known, and to model this uncertainty we consider that it is a random variable drawn from a specific probability distribution: Consider also observations of the truth, that are corrupted by normal observation errors η i+1 ∼ N (0 , R i+1 ). We consider here the case with a linear observation operator, H := H.
Consider and ensemble of N states x ∈ R n×N sampling the probability density that describes the uncertainty in the state at a given time moment (the time subscripts are omitted for simplicity of notation). The ensemble mean, ensemble anomalies, 25 and ensemble covariance arē respectively. Typically N is smaller than the number of positive Lyapunov exponents in our dynamical system, and much smaller than the number of states (Bergemann and Reich, 2010). Consequently, the ensemble statistics (4) are marred by considerable sampling errors. The elimination of sampling errors, manifested as spurious correlations in the covariance ma-30 trix (Evensen, 2009), leads to the need for localization.

Kalman analysis
The mean and the covariance are propagated first through the forecast step. Specifically, each ensemble member is advanced to the current time using the model (1) to obtain the ensemble forecast x f (with meanx f and covariance P f ) at the current time.
A covariance inflation step X f ← αX f , α > 1, can be applied to prevent filter divergence (Anderson, 2001). 5 The mean and covariance are then propagated through the analysis step, which fuses information from the forecast mean and covariance and from observations (3), to provide an analysis ensemble x a (with meanx a and covariance P a ) at the same time. Here we consider the deterministic EnKF (DEnKF) (Sakov and Oke, 2008), which obtains the analysis as follows: x a =x f + Kd, (5a) where K is the Kalman gain matrix and d the vector of innovations. DEnKF is chosen for simplicity of implementation and ease of amending to support Schur product-based localization. However, the adaptive localization techniques discussed herein 15 are general-they do not depend on this choice and are equally applicable to any EnKF algorithm.

Schur product localization
Covariance localization involves the Schur (element-wise) product between a symmetric positive semi-definite matrix ρ and the ensemble forecast covariance: 20 By the Schur product theorem (Schur, 1911), if ρ and P f are positive semi-definite, then their Schur product is positive semidefinite. The matrix ρ is chosen such that it reduces the sampling errors and brings the ensemble covariance closer to the true covariance matrix. We note that one can apply the localization in ways that mitigate the problem of storing full covariance matrices (Houtekamer and Mitchell, 2001). Efficient implementation aspects are not discussed further as they do not impact the adaptive localization approaches developed herein. 25 We seek to generate the entries of localization matrix ρ from a localization function : R ≥0 → [0, 1], used to represent the weights applied to each individual covariance: The function could be thought of as a regulator which brings the ensemble correlations in line with the physical correlations, which are often compactly supported functions (Gneiting, 2002). The metric d quantifies the physical distance between model variables, such that d(i, j) represents the spatial distance between the state-space variables x i and x j . The "radius of influence" parameter r represents the correlation spatial scale; the smaller r is the faster variables decorrelate with increasing distance.
If the spatial discretization is time-invariant, and and r are fixed, that the matrix ρ is also time invariant. The goal of the adaptive localization approach is to learn the best value of r dynamically from the ensemble.

5
A common localization function used in production software is due to Gaspari and Cohn (Gaspari and Cohn, 1999;Gneiting, 2002;Petrie, 2008). Here we use the Gaussian function to illustrate the adaptive localization strategy. However, our approach is general and can be used with any suitable localization function.

Extension to multivariate localization functions
It is intuitively clear that different physical effects propagate spatially at different rates, leading to different correlation distances. Consequently, different state-space variables should be analyzed using different radii of influence. This raises the additional question of how to localize the covariance of two variables when each of them is characterized by a different radius of influence. One approach (Roh et al., 2015) is to use a multivariate compactly supported function (Askey, 1973;Porcu et al., 15 2013) to calculate the modified error statistics. We however wish to take advantage of already developed univariate compactly supported functions.
We define the mapping operator r : N n → N g that assigns each state-space component to a group. All state variables assigned to the same group are analyzed using the same localization radius. Assuming a fixed mapping, the localization parameters are the radii values for each group υ ∈ R g . These values can be tuned independently during the adaptive algorithm. The 20 corresponding prolongation operator p : R g → R n assigns one of the g radii values to each of the n state-space components: Setting g = 1 and p(υ) = υ1 n , recovers the univariate localization approach.
Petrie (Petrie, 2008) showed that true square-root filters such as the LETKF (Hunt et al., 2007) are not amenable to Schur product-based localization, therefore they need to rely on sequentially assimilating every observation space variable. Here we 25 wish to combine both the ease-of-use of Schur product-based localization and the utility of multivariate localization techniques.
To this end, we introduce a commutative, idempotent, binary operation, m : R ≥0 × R ≥0 → R ≥0 , that computes a "mean localization function value" such as to harmonize the effects of different values of the localization radii. More explicitly, given We also impose the additional common sense property that a ≤ m(a, b) ≤ b. We consider here several simple mean functions m, as follows: Assume that the variables x i and x j have the localization radii r i and r j , respectively. We extend the definition of the localization matrix ρ to account for multiple localization radii associated with different variables as follows: The localization sub-matrix of state-space variables x i and x j : is a symmetric matrix for any mean function (10). When is a univariate compactly supported function, our approach implicitly defines its m-based multivariate counterpart.

15
One of the common criticisms of a distance assumption on the correlation of geophysical systems is that two variables in close proximity to each other might have very weak correlations. For example, in a model that takes into account the temperature and concentration of stationary cars at any given location, the two distinct types of information might not at all be correlated with each other. The physical distance between the two, however, is zero, and thus any single correlation function will take the value one and do not remove any spurious correlations. One can mitigate this problem by considering univariate 20 localization functions for each pair of components i,j = j,i , and extending the localization matrix as follows: We will not analyze multiple localization functions in the paper.

Bayesian Approach
We denote the analysis step of the filter by: where υ are the variable localization and inflation parameters. In this paper, we look solely at varying localization, and will keep inflation constant.
In the Bayesian framework, we consider the localization parameters to be random variables with an associated prior probability distribution. Specifically, we assume that each of the radii υ (j) is distributed according to a univariate gamma distribution ). The gamma probability density: has meanῡ = α/β and variance Var(υ) = α/β 2 . We have chosen the gamma distribution as it is the maximum entropy probability distribution for a fixed mean (Singh et al., 1986), i.e., is the best distribution that one can choose without additional information. It is supported on the interval (0, ∞) and has exactly two free parameters (allowing to control both the mean and variance).
The assimilation algorithm computes a posterior (analysis) probability density over the state space considering the probabil- 10 ities of observations and parameters. We start with Bayes' identity: Note that π(y) is a constant scaling factor. Here π(υ) represents the (prior) uncertainty in the parameters, π(x | y, υ) represents the uncertainty in the state for a specific value of the parameters, and π(y | x, υ) represents the likelihood of observations with respect to both state and parameters. Under Gaussian assumptions on the state and observation errors equation (16) can be 15 explicitly written out as: with f P (υ) given by equation (15). Note that, once the analysis scheme (14) has been chosen, the analysis state becomes a function of υ, and the conditional probability in (17) represents the posterior probability of υ.
The negative log likelihood of the posterior probability (17) is: Through liberal use of the chain rule and properties of symmetric semi-positive definite matrices, the gradient of the cost 5 function with respect to individual localization parameters reads: where we took advantage of the properties of symmetric matrices. Note that without the assumption that parameters are independent the gradient would involve higher order tensors.

Solving the optimization problem
Under the assumptions that the analysis function (14) is based on DEnKF (5), and (the not-quite-correct assumption) that all ensemble members are i.i.d., we obtain: Combining (18), (20), and (21) leads to the cost function form: which only requires the computation of the background covariance matrix HP f (υ) H in observation space, thereby reducing the amount of computation and storage. The equivalent manipulations of the gradient lead to: ∂υ (j) • P f . Calculation of the gradient (23) requires computations only in observation space. One will note that the form of our cost function (22) is similar to that of other hybrid approaches (Bannister, 2017) to data assimilation such as 3DEnVar (Hamill and Snyder, 2000). As a lot of similar computation is done, this technique could potentially be used as a preprocessing step in a larger hybrid data assimilation scheme. 5 The choice of using the DEnKF is a bit arbitrary. From the above, however, it is evident that a method that decouples the anomaly updates from the mean updates would most likely be more advantageous. A perturbed observation EnKF does not have this property, thus would incur significantly more computational effort to optimize the cost function. Extending this idea to true a square-root filter, like the ETKF, would require significant algebraic manipulation, and heuristics which are outside the scope of this paper.

Adaptive localization via a time-distributed cost function
We now seek to extend the 3D-Var-like cost function (18) to a time-dependent 4D-Var-like version. As the ensemble is essentially a reduced order model, we do not expect the accuracy of the forecast to hold over the long term as we might in traditional 4D-Var. We thereby propose a limited extension of the cost-function to include a small number K of additional future times.
Assuming that we are currently at the i-th time moment, the extended cost function reads: One will notice that in the 4D part of the cost function the future forecasts are now dependent on υ as they are obtained by running the model from x a i,(υ) . Similar to some 4D ensemble approaches, the gradient computations can be approximated by the tangent linear model with the adjoint not being required. In fact all that is required is matrix vector products which can be 20 approximated with finite differences (Tranquilli et al., 2017).
Various 4D-type approximation strategies are also applicable to this cost function extension, though are outside of the scope of this paper.

Algorithm
In practice, instead of dealing with the Gamma distribution parameters of α and β, we use the parametersῡ and Var(υ), such 25 that α =ῡ 2 Var(υ) and β =ῡ Var(υ) . For the sake of simplicity we assume that all υ are identically distributed, but this is not required for the algorithm to function. The initial guess for our minimization procedure is the vector of means. After minimizing the cost function, the radii for different components will be different. These radii, along with the corresponding localization and m functions, are used to build the localization matrix ρ. An outline is presented in Algorithm 1.

Numerical Experiments and Results
In order to validate our methodology we carry out twin experiments under the assumption of identical perfect dynamical 5 systems for both the truth and the model. The analysis accuracy is measured by the spatio-temporally averaged root mean square error: where n t is the number of data assimilation cycles (the number of analysis steps).
For each of the models we repeat the experiments with different values of the inflation constant α and report the best 10 (minimal RMSE) results.
All initial value problems used as were independently implemented

Oracles
We will make use of oracles to empirically evaluate the performance of the multivariate approach to Schur product localization.
An oracle is an idealized procedure that produces close to optimal results by making use of all the available information, some 15 of which is unknown to the data assimilation system. In our case the oracle minimizes cost functions involving the true model state. Specifically, in an ideal filtering scenario one seeks to minimize the error of the analysis with respect to the truth, i.e., the cost function J (x a ) = RMSE(x a − x t ). Our oracle will pick the best parameters, in this case the radii, that minimize the ideal cost function J (υ) = RMSE(x a (υ) − x t ). This can be viewed as a ideal variance minimization of the state space in parameter space.

Lorenz'96
The 40 variable Lorenz model (Lorenz, 1996;Lorenz and Emanuel, 1998) is the first standard test employed. This problem is 5 widely used in the testing of data assimilation algorithms.

Model Setup
The Lorenz'96 model equations: are obtained through a coarse discretization of a forced Burger's equation with Newtonian damping (Reich and Cotter, 2015). 10 We impose x 0 = x n , x −1 = x n−1 , and x n+1 = x 1 , where n = 40. We take the canonical value for the external forcing factor, F = 8. Using known techniques for dynamical systems (Parker and Chua, 2012) one can calculate that this particular system has 13 positive Lyapunov exponents (Strogatz, 2014) and one zero exponent, with a fractal dimension of approximately 27.1.

The initial conditions used for experiments are obtained by starting with
15 and integrating (27) forward for one time unit in order to reach the attractor.
The physical distance between x i and x j is the shortest cyclic distance between any two state variables: where the distance between two neighbors is one.
For the numerical experiments, we consider a perfect model and noisy observations. We take a six hours assimilation window

Oracle Results
Figure 1 shows a visualization of multivariate oracle runs for the Lorenz'96 test problem using best constant radius, univariate oracle, and multivariate oracle utilizing the m-functions from (10). As one can see, the univariate oracle performs no better than a constant radius, while several of the multivariate approaches provide much better results. This indicates that the problem is better suited for multivariate localization. The worst m-function results are given by m min . The other functions perform similarly, and in further experiments we will only consider m mean as a representative and easy to implement option. We note that for Lorenz-96 with DEnKF and our experimental setup, no 3D univariate adaptive localization scheme beats the best 5 constant localization radius.
We also test both the validity of arbitrarily grouping the radii and the validity of using a time-distributed cost function ( (25)). Figure 2 presents results for arbitrary radii groupings and for a limited run 4D approach. There is significant benefit in using more radii groupings, but marginal benefit from the 4D approach.

10
Adaptive localization results for Lorenz'96 are shown in figure 3. The optimal constant localization radius was found and a search of possible input mean and variance values was performed around it for the adaptive case. As accurately predicted by the oracle results, a univariate adaptive approach for this model cannot do better than the best univariate radius (figure 1), as no meaning reduction in error was detected.

Multivariate Lorenz'96
15 The canonical Lorenz'96 model is ill suited for multivariate adaptive localization as each variable in the problem behaves identically too all the others. This means that for any univariate localization scheme a constant radius is close to optimal.

Model Setup
We modify the problem in such a way that the average behavior remains very similar to that of the original model, but that instantaneous behavior requires different localization radii. In order to accomplish this we use the time-dependent forcing 20 function that is different for each variable: where ω = 2π (in the context of Lorenz'96 the equivalent period is 5 days), i is the variable index, and q is an integer factor of n. Here we set set q = 4. The different forcing values will create differences between the dynamics of different variables.
For each individual variable the forcing value cycles between 4 and 12, with an average value of 8, just like in the canonical 25 Lorenz'96 formulation. If taken to be a constant, the forcing factor of 4 will make the equation lightly chaotic with only one positive Lyapunov exponent, whilst a constant value of 12 will make the dynamics to have about 15 positive Lyapunov exponents. Our modified system still has the same average behavior with 13 positive Lyapunov exponents. The mean doubling times of the two problems are also extremely similar at around 0.36. This is the ideal desired behavior. Figure 4 shows a comparison of numerically computed covariance matrices for this modified problem. This showcases that an adaptive approach 30 to the instantaneous covariance is required.
Data assimilation will be carried out over the interval [500,5500]. The rest of the problem setup is identical to that of the canonical Lorenz'96. Figure 5 shows results with the Multivariate Lorenz'96 model, with g = q = 4 radii groups, and the 4D parameter set to K = 1. 5 As before, the results for the constant radius were their optimal for each given inflation value, while the adaptive results were obtained through a search of possible means and variances around that value. The maximal improvement is error is only about 8%, however this is leagues more than that of the univariate Lorenz '96. In the canonical Lorenz '96 there is no meaningful choice of grouping other than arbitrary, but in this case, the groups were chosen such that all related variables have the same forcing from equation (30). The results show a significant improvement over the univariate case, especially for low inflation 10 values. We note that the filter spin-uptakes significantly longer for the adaptive localization case than for the constant univariate case. Consequently, the assimilation cycles are chosen in the time interval [500,5500] units. An idea to mitigate this might be to run the filter with a constant radius for a few assimilation cycles before switching to the adaptive localization strategy, such as to allow the filter to quickly catch up with the shadow attractor trajectory. This gives us insight into a potential way of choosing multivariate localization groups. Based on some measure of the observability of any given state-space variable, similarly 'observable' state-space variables should have similar radii.

20
Tightly coupled models like the multivariate Lorenz'96 have rapidly diverging solutions, and constraining them requires more information about the underlying dynamics. Incorporating future observations and adding degrees of freedom to the cost function increase the performance of our analysis. In the limiting case of one radius per variable and general information from the future one approaches a variant of 4DenVar, which is in principle superior to any pure filtering method.

25
The 1.5 layer quasi-geostrophic model of Sakov and Oke (Sakov and Oke, 2008), obtained by non-dimensionalizing potential vorticity, is given by the equations: The variable ψ can be thought of as the stream function, and the spatial domain is the square (x, y) ∈ [0, 1] 2 . The constants are 30 F = 1600, ε = 10 −5 , and A = 2 × 10 −11 . We use homogeneous Dirichlet boundary conditions.
A second order central finite difference spatial discretization of the Laplacian operator is performed over the interior of a 129×129 grid, leading to a model dimension n = 127 2 = 16, 129. The time discretization is the canonical fourth-order explicit Runge-Kutta method with a timestep of 1 time units. The Helmholtz equation on the right-hand side of (31) is solved by an offline pivoted sparse Cholesky decomposition. J is calculated via the canonical Arakawa approximation (Arakawa, 1966;Ferguson, 2008). The 3 operator is implemented by repeated application of our discrete Laplacian.
The time between consecutive observations is 5 time units, and the model is run for 3300 such cycles. The first 300 cycles, 5 corresponding to the filter spin-up, are discarded, therefore the assimilation interval is [300, 3300] time units. Observations are performed with a standard 300 component observation operator, as shown in Figure 7. An observation error variance of 4 is taken for each component. The physical distance between two component is defined as: with (i x , i y ) and (j x , j y ) are the spatial grid coordinates of the state space variables i and j, respectively.

10
Our rough estimate of the number of positive Lyapunov exponents of this model is 1, 451, with a fractal dimension estimate of 6573.4, thus we will take a conservative 25 ensemble members whose initial states are derived from a random sampling of a long run of the model. This model has been tested extensively with both the DEnKF and with various localization techniques (Sakov and Bertino, 2011;Bergemann and Reich, 2010;Moosavi et al., 2018).

Adaptive Localization Results
The adaptive localization results for the Quasi-geostrophic problem are shown in Figure 8. As before a constant best univariate localization radius was calculated for each inflation value, and was used as a seed for varying the results in the adaptive case.
An improvement of RMSE of up to about 33% can be seen for certain inflation values.
One can readily see that for certain values of the inflation factor the adaptive localization procedure results in significant 20 reductions in analysis error, while for other values no significant benefits are observed.
The empirical utility of the adaptive localization technique is further analyzed in Figure 9 which compares the error results of a suboptimal constant radius with that of an adaptive run with the mean parameter set to the same values as the constant ones.
The adaptive results are-except in a few cases of filter divergence-always as good or better than their constant localization counterparts, with an improvement in error as large as 33% with the same mean radius as the constant radius, and as much as 25 50% with different radii. Even in the case where the adaptive filter diverged, the constant localization filter diverged as well.
This indicates that our localization scheme is significantly better than a corresponding sub-optimal constant scheme with the same parameters, as is typically the case in real-world production codes. This opens up the possibility of adapting existing systems that use a conservative suboptimal constant localization to an adaptive localization scheme. Figure 10 shows a sample of the radii obtained by the adaptive algorithm. The initial period of algorithm spin-up is clearly 30 visible. One notes that the adaptive scheme can discern-on average-whether the observations or the model are to be trusted more.

Conclusions
This paper proposes a novel Bayesian approach to adaptive Schur product-based localization. A multivariate approach is developed, where multiple radii corresponding to different types of variables are taken into account. The Bayesian approach is solved by constructing 3D and 4D cost functions, and minimizing them to obtain the maximum aposteriori estimates of 5 the localization radii. We show that in the case of the DEnKF these cost functions and their gradients are computationally inexpensive to evaluate, and can be relatively easily implemented within existing frameworks. We provide a new approach for assessing the performance of adaptive localization approaches through the use of restricted cost function oracles.
The adaptive localization approach is tested using the Lorenz'96 and the Quasi-Geostrophic models. Somewhat surprisingly, the adaptivity produces better results for the larger Quasi-Geostrophic problem. This may be due to the ensemble analysis 10 anomaly independence assumption made in Sect. 4.1, an assumption that holds better for a large system with sparse observations than for a small tightly coupled system with dense observations. The performance of the adaptive approach on the small, coupled Lorenz'96 system is increased by using multivariate and 4D extensions of the cost function.
We believe that the algorithm presented herein has a strong potential to improve existing geophysical data assimilation systems that use ensemble based filters such as the DEnKF. In order to avoid filter divergence in the long term, these systems  localization scheme used is an adaptive oracle search that minimizes the error of the Schur-product localized analysis with respect to the truth. Each of the forty of variables is given an independent radius in the multivariate case. The schemes (10) that closely mirror an unbiased mean, namely mmean, msqrt, mrms, and mharm, yield the best results, while the conservative scheme mmin performs no better than a univariate approach. (groups of the model state components). The assimilation cycle interval is [100, 1100], a fixed inflation value of α = 1.02 and the function mmean are used. As can be seen, there is significant benefit in using more radii groupings, but marginal benefit from the 4D approach.   5,16] for the constant case. The minimal errors are shown in the graph, and the corresponding radii are used as mean inputs into our adaptive algorithm. For the adaptive case we choose four arbitrary groupings of radii (g = 4) with the mean function mmean. The parameters υ take one of three possible values, the optimal constant radius −1, +0, and +1. The parameters Var(υ) take one value from {1/4, 1, 4}.
The 4D variable is set to K = 1 to look at an additional observation in the future. Step 3 4 5 6 7 8 9 Radius Figure 6. For the configuration in Figure 5 with α = 1.02,ῡ = 6, and Var(υ) = 1. Each unique marker type represents a unique radius grouping with a red · representing the first group, a green · representing the second, a blue · representing the third, and a black · representing the fourth. Groups two and four contain variables that were observed fully, and groups one and three contain variables that were observed partially. One will note the tendency of the fully observed radii to be less conservative in their deviation from the mean. The graph has been truncated to only show radii from 3 to 9 in order to better capture the general tendancies.  constant, and we test α values from 1.02 to 1.16 (represented on the x-axis). The constant localization radius varies in increments of 5 over the range [5,45]. Only the best results are plotted, and are used as the mean seeds for the adaptive algorithm. For the adaptive case we vary the parametersῡ to by taking one of three possible values, differing from the optimal constant radius by −5, 0, and +5. The adaptive Var(υ) takes one of the values 1/2, 1, 2, and 4. As can be seen, for higher values of inflation, the adaptive localization approach performs stellarly. Figure 9. QUASI-GEOSTRPHIC MODEL ADAPTIVE LOCALIZATION RAW RMSE. A better representation of how well the adaptive localization scheme works is by showing its consistency. The green line represents the same optimal constant localization radius as in Figure 8.
The red line represents the error for the radius -5 units below, and the blue line, if it had not experienced filter divergence would have represented +5. The correspondingly colored areas represent the ranges of error of the adaptive localization scheme obtained by fixing the mean, υ, to be that of the constant scheme and ranging over the variance. As can be seen, the adaptive scheme generally outperformed the constant scheme for almost all ranges of the variance and even managed to not suffer from filter divergence for some values of the variance in the +5 case, thereby providing evidence for the fact that an overestimation of the variance coupled with a useful estimate of the mean, might produce more useful results than the corresponding constant counterpart.  Figure 8 with α = 1.08,ῡ = 25, and Var(υ) = 4. Each dot represents a radius, with the line representing the mean. One will notice that during the spin-up time the algorithm is much more conservative with the radius of influence signaling that there should be an over-reliance on the observations instead of the model prior. After the spin-up time however, the algorithm tends to select radii greater than the mean, signifying a greater confidence in the observations.