The ensemble Kalman filter (EnKF) is a widely used ensemble-based assimilation method, which estimates the forecast error covariance matrix using a Monte Carlo approach that involves an ensemble of short-term forecasts. While the accuracy of the forecast error covariance matrix is crucial for achieving accurate forecasts, the estimate given by the EnKF needs to be improved using inflation techniques. Otherwise, the sampling covariance matrix of perturbed forecast states will underestimate the true forecast error covariance matrix because of the limited ensemble size and large model errors, which may eventually result in the divergence of the filter.
In this study, the forecast error covariance inflation factor is estimated using a generalized cross-validation technique. The improved EnKF assimilation scheme is tested on the atmosphere-like Lorenz-96 model with spatially correlated observations, and is shown to reduce the analysis error and increase its sensitivity to the observations.
For state variables in geophysical research fields, a common assumption is that systems have “true” underlying states. Data assimilation is a powerful mechanism for estimating the true trajectory based on the effective combination of a dynamic forecast system (such as a numerical model) and observations (Miller et al., 1994). Data assimilation provides an analysis state that is usually a better estimate of the state variable because it considers all of the information provided by the model forecasts and observations. In fact, the analysis state can generally be treated as the weighted average of the model forecasts and observations, while the weights are approximately proportional to the inverse of the corresponding covariance matrices (Talagrand, 1997). Therefore, the performance of a data assimilation method relies significantly on whether the error covariance matrices are estimated accurately. If this is the case, the assimilation can be accomplished with the rapid development of supercomputers (Reichle, 2008), although finding the appropriate analysis state is a much difficult problem when the models are nonlinear.
The ensemble Kalman filter (EnKF) is a practical ensemble-based assimilation scheme that estimates the forecast error covariance matrix using a Monte Carlo method with the short-term ensemble forecast states (Burgers et al., 1998; Evensen, 1994). Because of the limited ensemble size and large model errors, the sampling covariance matrix of the ensemble forecast states usually underestimates the true forecast error covariance matrix. This finding indicates that the filter is over reliant on the model forecasts and excludes the observations. It can eventually result in the divergence of the filter (Anderson and Anderson, 1999; Constantinescu et al., 2007; Wu et al., 2014).
The covariance inflation technique is used to mitigate filter divergence by inflating the empirical covariance in EnKF, and it can increase the weight of the observations in the analysis state (Xu et al., 2013). In reality, this method will perturb the subspace spanned by the ensemble vectors and better capture the sub-growing directions that may not have been captured by the original ensemble (Yang et al., 2015). Therefore, using the inflation technique to enhance the estimate accuracy of the forecast error covariance matrix is increasingly important.
A widely used inflation technique involves multiplying the forecast error matrix by an inflation factor, which must be chosen appropriately. In early studies, researchers usually tuned the inflation factor by repeated assimilation experiments and selected the estimated inflation factor according to their experience and prior knowledge (Anderson and Anderson, 1999). However, such methods are very empirical and subjective. It is not appropriate to use the same inflation factor during all the assimilation procedure. Too small or too large an inflation factor will cause the analysis state to over rely on the model forecasts or observations, and can seriously undermine the accuracy and stability of the filter.
In later studies, the inflation factor is estimated online based on the innovation statistic (observation-minus-forecast; Dee, 1995; Dee and Silva, 1999) with different conditions. Moment estimation can facilitate the calculation by solving an equation of the innovation statistic and its realization (Li et al., 2009; Miyoshi, 2011; Wang and Bishop, 2003). Maximum likelihood approach can obtain a better estimate of the inflation factor than moment approach, although it must calculate a high-dimensional matrix determinant (Liang et al., 2012; Zheng, 2009). Bayesian approach assumes a prior distribution for the inflation factor but is limited by spatially independent observational errors (Anderson, 2007, 2009). This study seeks to address the estimation of the inflation factor from the perspective of cross-validation (CV).
The concept of CV was first introduced for linear regressions (Allen, 1974) and spline smoothing (Wahba and Wold, 1975), and it represents a common approach that can be applied to estimate tuning parameters in generalized additive models, nonparametric regressions and kernel smoothing (Eubank, 1999; Gentle et al., 2004; Green and Silverman, 1994; Wand and Jones, 1995). Usually, the data are divided into subsets some of which are used for modeling and analysis while others for verification and validation. The most widely used technique removes only one data point and uses the remainder to estimate the value at this point to test the estimation accuracy, which is also called the leave-one-out cross-validation (Gu and Wahba, 1991).
The basic motivation behind CV is to minimize the prediction error at the sampling points. The generalized cross-validation (GCV) is a modified form of ordinary CV, that has been found to possess several favorable properties and is more popular for selecting tuning parameters (Craven and Wahba, 1979). For instance, Gu and Wahba (1991) applied the Newton's method to optimize the GCV score with multiple smoothing parameters in a smoothing spline model. Wahba et al. (1995) briefly reviewed the properties of the GCV and conducted an experiment to choose smoothing parameters in the context of variational data assimilation schemes with numerical weather prediction models. Zheng and Basher (1995) also applied the GCV in a thin-plate smoothing spline model of spatial climate data to deal with South Pacific rainfalls.
Actually, the GCV criterion is based on a predictive mean-square-error criterion that attempts to obtain a best estimate (Wahba et al., 1995). It has a rotation-invariant property that is relative to the orthogonal transformation of the observations and is a consistent estimate of the relative loss (Gu, 2002). For the inverse problems in such fields as meteorological data assimilation, GCV method can choose parameters systematically by minimizing a given objective function that will improve the assimilation results. It can particularly select parameters that reflect not only measurement accuracies from different sources but also model capability (Krakauer et al., 2004).
This study proposes a new method for choosing the inflation factor using GCV method. The suitability of this choice is assessed using a statistic known as the analysis sensitivity, which apportions uncertainty in the output to different sources of uncertainty in the input (Saltelli et al., 2004, 2008). In the context of statistical data assimilation, this quantity describes the sensitivity of the analysis to the observations, which is complementary to the sensitivity of the analysis to model forecasts (Cardinali et al., 2004; Liu et al., 2009).
This study focuses on a methodology that can be potentially applied to geophysical applications of data assimilation in the near future. This paper consists of four sections. The conventional EnKF scheme is summarized and the improved EnKF with GCV inflation scheme is proposed in Sect. 2, the verification and validation processes are conducted on an idealized model in Sect. 3, the discussions are presented in Sect. 4 and conclusions are given in Sect. 5.
For consistency, a nonlinear discrete-time dynamical forecast model and
linear observation system can be expressed as follows (Ide et al., 1997):
Suppose the perturbed analysis state at a previous time step
The perturbed forecast states are generated by running dynamical model
forward:
The analysis state is estimated by minimizing the following cost function:
The forecast error inflation procedure should be added to any ensemble-based
assimilation scheme to prevent the filter from diverging (Anderson and
Anderson, 1999; Constantinescu et al., 2007). Multiplicative inflation is one
of the commonly used inflation techniques, and it adjusts the initially
estimated forecast error covariance matrix
In this study, a new procedure for estimating multiplicative inflation
factors
Flowchart of the proposed assimilation scheme.
The inflation factor
In the EnKF, the analysis state (Eq. 6) is a weighted average of the
observation and forecast. That is
The elements of the matrix
In fact, the sensitivity matrix
In the conventional EnKF, the forecast error covariance matrix
The spread of the forecast ensemble at the
In the following experiments, the “true” state
The proposed data assimilation scheme was tested using the Lorenz-96 model (Lorenz, 1996) with model errors and a linear observation system as a test bed. The performances of the assimilation schemes described in Sect. 2 were evaluated via the following experiments.
The Lorenz-96 model (Lorenz, 1996) is a quadratic nonlinear dynamical system
that has properties relevant to realistic forecast problems and is governed
by the equation
The true state is derived by a fourth-order Runge–Kutta time integration
scheme (Butcher, 2003). The time step for generating the numerical solution
was set at 0.05 non-dimensional units, which is roughly equivalent to 6 h in
real time, assuming that the characteristic timescale of the dissipation in
the atmosphere is 5 days (Lorenz, 1996). The forcing term was set as
In this study, the synthetic observations were assumed to be generated by
adding random noises that were multivariate normally distributed with mean
zero and covariance matrix
Because model errors are inevitable in practical dynamical forecast models,
it is reasonable to add model errors to the Lorenz-96 model in the
assimilation process. The Lorenz-96 model is a forced dissipative model with
a parameter
The Lorenz-96 model was run for 2000 time steps, which is equivalent to approximately 500 days in realistic problems. The synthetic observations were assimilated at every grid point and every 4 time steps using the conventional EnKF, the constant inflated EnKF and the improved EnKF schemes for comparisons. The time series of estimated inflation factors are shown in Fig. 2. It can be seen that the estimated inflation factors vary between 1 and 6 in most instances, although the values smaller than 1 are estimated in several assimilation time steps. The median of the estimated inflation factors was 1.88, which was used as the inflation factor in the constant inflated EnKF scheme. Since the median is a robust and highly efficient statistic of the central tendency, this can ensure a relative fair comparison between the constant inflated EnKF and the improved EnKF schemes.
Time series of the estimated inflation factors by minimizing the GCV function. The median of the estimated inflation factors is 1.88.
The forecast ensemble spread of the conventional EnKF, constant inflated EnKF and improved EnKF are plotted in Fig. 3. For the conventional EnKF, because the forecast states usually shrink together, the forecast ensemble spread was quite small and had a mean value of 0.36. The mean spread value of the improved EnKF was 3.32, which was larger than that of the constant inflated EnKF (3.25). These findings illustrate that the underestimation of forecast ensemble spread can be effectively compensated for by the two EnKF schemes with forecast error inflation and that the improved EnKF is more effective than the constant inflated EnKF.
Forecast ensemble spread of the conventional EnKF (black line), the constant inflated EnKF (red line) and the improved EnKF (blue line) for the Lorenz-96 experiment with 40-observation and 30-ensemble member. The constant multiplicative inflation factor is set as 1.88.
To evaluate the analysis sensitivity, the GAI statistics (Eq. 16) were calculated, and the results are plotted in Fig. 4. The GAI value increases from 10 % for the conventional EnKF to 30 % for the improved EnKF, indicating that the latter relies more on the observations. This finding is important because the observations can play a significant role in combining the results with the model forecasts to generate the analysis state. In addition to small fluctuations, the mean GAI value of the constant inflated EnKF was 27.80 %, which was smaller than that of the improved EnKF.
GAI statistics of the conventional EnKF (black line), the constant inflated EnKF (red line) and the improved EnKF (blue line) for the Lorenz-96 experiment with 40-observation and 30-ensemble member. The constant multiplicative inflation factor is set as 1.88.
To evaluate the analysis estimate accuracy, the analysis RMSE (Eq. 18) and the corresponding values of the GCV functions (Eq. 9) were calculated and plotted in Figs. 5 and 6, respectively. The results illustrate that the analysis RMSE and the values of the GCV functions decrease sharply for the two EnKF with forecast error inflation schemes. However, the GCV function and the RMSE values of the improved EnKF were about 15 % smaller than those of the constant inflated EnKF, indicating that the online estimate method performs better than the simple multiplicative inflation techniques with a constant value. The correlation coefficient of the analysis RMSE and the value of the GCV function at the assimilation time step were approximately 0.76, which indicates that the GCV function is a good criterion to estimate the inflation factor.
Analysis RMSE of the conventional EnKF (black line), the constant inflated EnKF (red line) and the improved EnKF (blue line) for the Lorenz-96 experiment with 40-observation and 30-ensemble member. The constant multiplicative inflation factor is set as 1.88.
GCV function values of the conventional EnKF (black line), the constant inflated EnKF (red line) and the improved EnKF (blue line) for the Lorenz-96 experiment with 40-observation and 30-ensemble member. The constant multiplicative inflation factor is set as 1.88.
The ensemble analysis state members of the conventional EnKF, constant inflated EnKF and improved EnKF are shown in Fig. 7, and the results indicate the uncertainty of the analysis state to some extent. The true trajectory obtained by the numerical solution is also plotted. It illustrates that a larger difference occurred between the true trajectory and the ensemble analysis state members for the conventional EnKF than for the improved EnKF and constant inflated EnKF. In addition, the analysis state was more consistent with the true trajectory for the improved EnKF than that for the constant inflated EnKF. Therefore, the GCV inflation can lead to a more accurate analysis state than the simple constant inflation.
Ensemble analysis state members of the conventional EnKF (black line), the constant inflated EnKF (red line) and the improved EnKF (blue line) for the Lorenz-96 experiment with 40-observation and 30-ensemble member. The constant multiplicative inflation factor is set as 1.88. The green line refers to the true trajectory obtained by the numerical solution.
The time-mean values of the forecast ensemble spread, the GAI statistics, the GCV functions and the analysis RMSE over 2000 time steps are listed in Table 1. These results illustrate that the forecast error inflation technique using the GCV function performs better than the constant inflated EnKF, which can indeed increase the analysis sensitivity to the observations and reduce the analysis RMSE.
Time-mean values of the forecast ensemble spread, GAI statistics, GCV functions and analysis RMSE over 2000 time steps, as well as the running times (second) for different assimilation schemes. The observation number is 40 and the ensemble size is selected as 10, 30 and 50, respectively.
Intuitively, for any ensemble-based assimilation scheme, a large ensemble size will lead to small analysis errors; however, the computational costs are high for practical problems. The ensemble size in the practical land surface assimilation problem is usually several tens of members (Kirchgessner et al., 2014). The preferences of the proposed inflation method and the constant inflation method with respect to different ensemble sizes (10, 30 and 50) were evaluated, and the results are listed in Table 1. It shows that for each scheme, using a 10-member ensemble produced a 3-fold increase in the analysis RMSE, while using a 50-member ensemble reduced the analysis RMSE by 20 % relative to the analysis RMSE obtained using a 30-member ensemble. The forecast ensemble spread increased slightly from a 10-member ensemble to a 50-member ensemble. The GAI and GCV function values changed sharply from a 10-member ensemble to a 30-member ensemble, and they became relatively stable from a 30-member ensemble to a 50-member ensemble. Ensembles less than 10 were unstable, and no significant changes occurred for ensembles greater than 50. Considering the computational costs for practical problems, a 30-member ensemble may be necessary for Lorenz-96 model to estimate statistically robust results. In the realistic problem, a system in which the errors grow in multiple directions will need more ensembles to produce statistically robust results.
To evaluate the preferences of the inflation method with respect to different numbers of observations, synthetic observations were generated at every other grid point and for every 4 time steps. Hence, a total of 20 observations were performed at each observation step in this case. The assimilation results with ensemble sizes of 10, 30 and 50 are listed in Table 2, which shows that the GAI values were larger than those with 40-observations in all assimilation schemes. This finding may be related to the relatively small denominator of the GAI statistic (Eq. 16) in the 20-observation experiments. The forecast ensemble spread does not change much but the GCV function and the RMSE values increase greatly in the 20-observation experiments with respect to those in the 40-observation experiments, which illustrates that more observations will lead to less analysis error.
Same as in Table 1 but for 20 observations.
Accurate estimates of the forecast error covariance matrix are crucial to the success of any data assimilation scheme. In the conventional EnKF assimilation scheme, the forecast error covariance matrix is estimated as the sampling covariance matrix of the ensemble forecast states. However, limited ensemble size and large model errors often cause the matrix to be underestimated, which produces an analysis state that over relies on the forecast and excludes observations. This can eventually cause the filter to diverge. Therefore, the forecast error inflation with proper inflation factors is increasingly important.
The use of multiplicative covariance inflation techniques can mitigate this problem to some extent. Several methods have been proposed in the literature, and each has different assumptions. For instance, the moment approach can be easily conducted based on the moment estimation of the innovation statistic. The maximum likelihood approach can obtain a more accurate inflation factor than the moment approach, but requires computing high-dimensional matrix determinants. The Bayesian approach assumes a prior distribution for the inflation factor but is limited to spatially independent observational errors. In this study, the inflation factor was estimated based on cross-validation and the analysis sensitivity was detected. The estimated inflation factor by minimizing the GCV function is not affected by the observation unit and can optimize the analysis sensitivity to the observation.
In fact, the GCV method can evaluate and compare learning algorithms and represents a widely used statistical method. It can be applied in inverse problems in such fields as meteorological data assimilation (Wahba et al., 1995). Specifically, GCV provides a well-characterized method, which can select a regularization parameter by minimizing the predictive data errors with rotation-invariant in a least-squares solution (MacCarthy et al., 2011). In data assimilation research fields, observation data such as in situ observation and remote sensing data are usually from different sources. GCV is particularly useful for choosing relative parameters that reflect not only measurement accuracies from different sources but also model capability (Krakauer et al., 2004). Apparently, GCV method requires calculating the trace of a large matrix, which may be commonly computationally prohibitive for large inverse problems (MacCarthy et al., 2011).
In this study, the GCV concept was adopted for the inflation factor estimation in the improved EnKF assimilation scheme and was validated with the Lorenz-96 model. The assimilation results showed that inflating the conventional EnKF using the factor estimated by minimizing the GCV function can indeed reduce the analysis RMSE. Therefore, the GCV function can accurately quantify the goodness of fit of the error covariance matrix. The values of the GCV function obviously decreased in the proposed approach compared the conventional EnKF and constant inflated EnKF schemes. The analysis RMSE of the proposed approach was also much smaller than those of the conventional EnKF and constant inflated EnKF schemes, which suggests that the GCV criterion works well for estimating the inflation factor.
The analysis sensitivities in the proposed approach and in the conventional EnKF scheme were also investigated in this study. The time-averaged GAI statistic increases from about 10 % in the conventional EnKF scheme to about 30 % using the proposed inflation method. This illustrates that the inflation mitigates the problem of the analysis depending excessively on the forecast and excluding the observations. The relationship of the analysis state to the forecast state and the observations are more reasonable.
The highest computational cost when minimizing the GCV function is related to
calculating the influence matrix
For the Lorenz-96 experiments in this study, the conventional EnKF, constant inflated EnKF and proposed improved EnKF assimilation schemes were conducted using R language on a computer with Intel Core i5 CPU and 8 GB RAM. The running times with different observation numbers and ensemble sizes were listed in Tables 1 and 2. It shows that for each assimilation scheme, the computational cost increases as the ensemble size grows. For the fixed observation number and ensemble size, the conventional EnKF, which does not involve the forecast error inflation, has the least running time but at a cost of losing assimilation accuracy. The proposed EnKF scheme is about 15 % smaller in analysis RMSE, but only about 5 % longer in running time than the constant inflated EnKF scheme. For the operational meteorological/ocean models, the most computational cost is in the ensemble model integrations (Ravazzani et al., 2016). Therefore, the proposed EnKF scheme does not significantly increase computational cost.
It is worth noting that the inflation factor is assumed to be constant in space in this study, which may be not the case in realistic assimilation problems. Forcing all components of the state vector to use the same inflation factor could systematically overinflate the ensemble variances in sparsely observed areas, especially when the observations are unevenly distributed. In the presence of sparse observations, the state that is not observed can be improved only by the physical mechanism of the forecast model, although this improvement is limited. Therefore, a multiplicative inflation may not be sufficiently effective to enhance the assimilation accuracy. In this case, the additive inflation and the localization technique can be applied to further improve the assimilation quality in the presence of sparse observations (Miyoshi and Kunii, 2011; Yang et al., 2015).
In this study, the approach for using GCV as a metric to estimate the covariance inflation factor was proposed. In the case studies conducted in Sect. 3, the observations were relatively evenly distributed and the assimilation accuracy could indeed be improved by the forecast error inflation technique. These findings provide insights on the methodology and validation of the Lorenz-96 model and illustrate the feasibility of our approach. In the near future, methods of modifying the adaptive procedure to suit the system with unevenly distributed observations and applying to more sophisticated dynamic and observation systems will be investigated.
No data sets were used in this article.
From Eq. (2), the normalized observation equation can be defined as follows:
Because the analysis state
The sensitivities of the analysis to the observation are defined as follows:
The authors declare that they have no conflict of interest.
This work is supported by the National Natural Science Foundation of China (grant no. 91647202), the National Basic Research Program of China (grant no. 2015CB953703), the National Natural Science Foundation of China (grant no. 41405098) and the Fundamental Research Funds for the Central Universities. The authors would like to gratefully acknowledge the two anonymous reviewers and the editor for their constructive comments, which helped significantly in improving the quality of this manuscript. Edited by: Amit Apte Reviewed by: two anonymous referees