Training a convolutional neural network to conserve mass in data assimilation

In previous work, it was shown that preservation of physical properties in the data assimilation framework can significantly reduce forecasting :::::: forecast : errors. Proposed data assimilation methods, such as the quadratic programming ensemble (QPEns) that can impose such constraints on the calculation of the analysis, are computationally more expensive, severely limiting their application to high dimensional prediction systems as found in earth sciences. In order to produce from a less computationally expensive, unconstrained analysis, a solution that is closer to the constrained analysis, we ::: We :::::::: therefore 5 propose to use a convolutional neural network (CNN) trained on analyses produced by the QPEns :: the ::::::::: difference :::::::: between :: the :::::::: analysis :::::::: produced ::: by : a :::::::: standard :::::::: ensemble ::::::: Kalman ::::: Filter ::::::: (EnKF) :::: and ::: the :::::: QPEns :: to ::::::: correct ::: any ::::::::: violations :: of :::::::: imposed ::::::::: constraints. In this paper, we focus on conservation of mass and show in an idealized setup that the hybrid of a CNN and the ensemble Kalman filter ::::: EnKF is capable of reducing analysis and background errors to the same level as the QPEns. To obtain these positive results, it was in one case necessary to add a penalty term to the loss function of the CNN training process. 10 Copyright statement.


Introduction
The ensemble Kalman Filter (EnKF Evensen, 1994;Burgers et al., 1998;Evensen, 2009) and versions thereof are powerful data assimilation algorithms that can be applied to problems that need an estimate of a high dimensional model state, as in weather forecasting. An important condition for a successful application of the EnKF to a large system is the use of localisation. 15 Any localisation method aims to diminish sampling errors caused by the computational limitation of :: the : ensemble size. By doing so, mass conservation as guaranteed by a numerical model is violated during the data assimilation (Janjić et al., 2014).
Fully replacing data assimilation by a NN has been attempted by Cintra and de Campos Velho (2014) in the context of a simplified atmospheric general circulation model. They trained on a cycling data set produced by the Localized :::: Transform Kalman Filter (LETKF, Bishop et al., 2001;Hunt et al., 2007) and show that the trained NN performs nearly as good as the LETKF with significantly reduced computational effort. Other applications of NNs in context of data assimilation 50 are for observational bias correction (Jin et al., 2019) and tuning of covariance localization (Moosavi et al., 2019). Similarly, in ::::::::: localisation :::::::::::::::::: (Moosavi et al., 2019) : . :: In this paper we take an approach that combining the NN with a data assimilation algorithm will allow extracting the most information from sparse and noisy observations, as argued in ::: for ::::::: example Brajard et al. (2020a).
We aim to produce better results than standard data assimilation algorithms at minimal additional computational costs, by training on data produced by the QPEns. 55 We generate our training data by performing twin experiments with the one dimensional modified shallow water model (Würsch and Craig, 2014) which was designed to mimic important properties of convection. These aspects include an acute regime switch when convection is triggered (conditional instability) and a significant time lag between the onset of convection and its observation. The model is briefly introduced in section 2.1, followed by the settings of the twin experiments in section 2.2. Section 2.3 provides a report on the generation of the training data. Since both our input and output are full model states, 60 the obvious choice is to train a convolutional neural network (CNN), as the convolution with kernels naturally acts as a form of localisation. The CNN architecture we use for this application is described in section 2.4. The results are presented in section 3, followed by the conclusion in section 4. The modified shallow water model (Würsch and Craig, 2014) consists of the following equations for the velocity u, rain r and water height level of the fluid h respectively: Here ::::: Above, h c represents the level of free convection. When this threshold is reached the geopotential φ takes on a lower, 75 constant value φ c . The parameters D u , D r , D h are the corresponding diffusion constants :::::::: diffusion :::::::: constants :::::::::::: corresponding :: to :: u, : r, :: h, ::::::::::: respectively. ::::::::: Coefficient : γ := √ gh 0 is the gravity wave speed for the absolute fluid layer h 0 (h 0 < h c ). The small stochastic Gaussian :::::::: Gaussian :::::: shaped forcing β u is added at random locations to the velocity : u : at every model time step, : . :::: This : is ::::: done : in order to trigger perturbations that lead to convection. Parameters δ and α are the production and removal rate of rain respectively. When h reaches the rain threshold h r (h r > h c ), rain is 'produced'by adding rain water mass to the potential, 80 leading to a decrease of the water level and of buoyancy. The model conserves mass, so the spatial integral over h is constant in time.

Twin experiments
The nature run which mimics the true state of the atmosphere is a model simulation starting from an arbitrary initial state.
The ensemble is chosen to be of small size with N ens = 10, and, like the nature run, each member starts from an arbitrary 90 initial state. Observations are assimilated every dT model time steps and are obtained by adding a Gaussian error to the wind u and height h field of the nature run at the corresponding time with a standard deviation of σ u = 0.001 m/s and σ h = 0.01 m, and a lognormal error is added to the rain r field with parameters :: of ::: the ::::::::: underlying :::::: normal :::::::::: distribution : µ = −8 and σ = 1.5.
For all variables the observation error is roughly 10% of the maximum deviation from the variable mean. To mimic radar data, observations for all variables are available only on grid points where rain above a threshold of 0.005 dBZ is measured.

95
A random selection, amounting to 10% of the remaining grid points, of additional wind observations are assimilated, which represents additional data available ::::::: available :::: data (for example obtained from aircraft).
To deal with undersampling, covariance localization :::::::::: localisation using 5-th order polynomial :::::::: piecewise :::::: rational : function (Gaspari and Cohn, 1999) is applied with a localisation radius of four grid points. This corresponds to the localisation radius for which the EnKF yields minimum analysis RMSE values of the rain variable for an ensemble size of ten. An in-100 terior point method is used to solve the quadratic minimization :::::::::: minimisation : problems of the QPEns. The constraints that are applied are mass conservation, i.e. e T (h a − h f ) = e T δh = 0 ::::::::::::::::::::: e T (h a − h b ) = e T δh = 0, and positivity of precipitation, i.e. r a = δr + r f ≥ 0 :::::::::::::: r a = δr + r b ≥ 0. Here, the superscript f : b denotes the background and a the analysis, and e is a vector of size n containing only values of one. For the EnKF negative values for rain are set to zero if they occur.

120
The output of our training set Y tr ∈ R NensT ×n×3 is simply a reshaped and normalized version of the data set {Q a t : t = 1, 2, ..., T }. For the input of our training set X tr we choose to use an index vector indicating the position of the radar observations {I t : t = 1, 2, ..., T } in addition to the unconstrained solutions {X a t : t = 1, 2, ..., T }, yielding X tr ∈ R NensT ×n×4 , where the index vector I t is copied N ens times to obtain I * t ∈ R Nens×n×3 . For u and h the input and output data set is normalized by subtracting the climatological mean before dividing by the climatological standard deviation. For r, we do not subtract the 125 climatological mean to maintain positivity.
A validation data set X valid and Y valid is created to monitor the training process exactly as the training data set but with a different random seed number : is ::::::: created :: to ::::::: monitor ::: the ::::::: training :::::: process. For both the training and validation data set we set T = 4800, which amounts to a total of N ens T = 48000 training and validation samples respectively. 130 We choose to use a CNN with 4 convolutional hidden layers, consisting of 32 filters each with kernels of size 3 and the "selu" activation function
The output layer has therefore 3 filters and the kernel size is again 3. Note that the "localisation radius", that is, the maximum influence radius of a variable as assumed by the CNN is (3 − 1)/2 * 5 = 5, where 5 is the number of layers and 3 the kernel 140 size. We use a linear activation function for u and h and the "relu" activation function for r to ensure non-negativity of rain. We set the batch size to 96 and do ::: run 100 epochs. Unless stated otherwise, the loss function is defined as the root mean squared error (RMSE) over the grid points, averaged over the variables:

150
We assign the name dT 5 to the experiment corresponding to a cycling period of 5 minutes, and dT 10 to the experiment corresponding to a cycling period of 10 minutes. Figure 1 shows the evolution of the loss function averaged over the samples for the training and validation data set for dT 5 and dT 10 respectively. Table 1 summarizes what the CNN has learned for each variable separately in the two cases. As the training data is normalized, we can conclude from the RMSE of the input data with respect to the output data (first row in Table 1 panels) that the mass constraint on h and the positivity constraints on r impacts 155 the solution of the minimization problem for all variables with the same order of magnitude. Given our choice of loss function it is not surprising that the relative reduction of the gap between the input and output by the CNN is proportional to the size of the gap. By aiming to minimize the mean RMSE of all variables, the CNN reduced the violation of the mass constraint by Input 9.6e-2 7.9e-2 9.0e-2 1.2e-1 3.6e-2 8.0e-3 3.3e-2 Prediction 6.5e-2 7.3e-2 6.9e-2 5.4e-2 2.9e-2 7.1e-3 2.3e-2 Improvement (%) 32 7.8 24 55 20 11 30 Table 1. The loss function, the mean RMSE of the variables u,h,r, the absolute mass error divided by the number of grid points n for h and r, and the bias of h (columns) calculated for the input X valid (top row) and the CNN prediction (middle row) with respect to the output Y valid for the validation data sets. The last row shows the improvement of the prediction towards the output compared to the input in percentage. about 20% for both experiments. However, for dT 5 the reduction in the bias of the height field is 100%, while for dT 10 it is a mere 30%.

160
Next, we are interested in how the CNNs perform when applied within the data assimilation cycling. In Figure 2, we compare the performance of the EnKF, QPEns and the hybrid of CNN and EnKF, where CNN is applied as correction to the initial conditions computed by the EnKF. To avoid having to train a CNN for the spin-up phase where the increments are larger, we start the data assimilation for the EnKF and the CNN from the initial conditions produced by the QPEns at the 20 th cycle.
With respect to RMSEs, for dT 5 the CNN performs as well as the QPEns, despite having learned during training only 27% of the difference between the EnKF and QPEns analysis in terms of the loss function. For dT 10 the CNN does perform significantly better than the EnKF, but clearly remains inferior to the QPEns. Given that in terms of the RMSE over the grid points, the CNN for dT 10 is slightly better than the one for dT 5, we hypothesize that the key to good performance of the CNN 170 applied within the data assimilation cycling lies with preventing the accumulation of mass in h. When mass accumulates in clear regions, that is regions where for the nature run holds h < h c , it has a snowball effect not only on h itself but also on r, see  To support this claim we trained an additional CNN with the training set corresponding to dT = 10, with a penalty term for the mass of h in the loss function: y j,i,2 ::::::::::::::::::::::::::::::::::::::::: where the parameter γ : η : is tunable. The larger γ : η, the better the mass of h is conserved at the expense of the RMSE, :::: see ::::: Figure :: 4. We found a good trade-off for γ = 0.01 :::: η = 2. We refer to this experiment as dT 10 γ ::::::: dT 10 η=2 . The training process is illustrated in Figure 4. The mass conservation term comprises about 40% of the total loss functionĴ. Both terms of the loss 185 function are decreasing at approximately the same rate throughout the entire training process. Comparing Table 1 with Table   2 we conclude that by adding the penalty term for the mass violation in the loss function, 6% ::: 7% :: of ::::::::::: improvement : was lost in terms of loss function J, but 32 :: 29% was gained in the conservation of mass. In clear regions, the mass violation reduction Left: :::::: relative :::::::::: improvement :: in :: % :: of ::::: RMSE ::::::: (orange) ::: and :::: mass :::: error :::::: (green) :::::: towards :: the :::::: output ::: with :::::: respect :: to :: the :::: input :: as :: a :::::: function :: η. ::::: Right: :::: value : of the loss functionĴ (solid) and J (dashed) averaged over samples for the training (red) and validation (blue) data set as function of epochs for dT 10γ ::::::: dT 10η=2.

13
Geoscience phenomena have several aspects that are different from standard data science applications, for example governing physical laws, noisy , :::::::::: observations :::: that ::: are non-uniform in space and time observations from many different sources, as well as 210 ::: and rare interesting events. This makes the use of NNs challenging in particular ::::::::: particularly :::::::::: challenging : for convective scale applications, although attempts have been made for predicting rain, hail or tornadoes (McGovern et al., 2019). The approach taken in this study, combines noisy and sparse observations with a dynamical model using a data assimilation algorithm, but also uses a NN in order ::: and :: in ::::::: addition ::: uses :: a :::: CNN : to improve on conservation of physical laws. In previous work it was shown in idealized ::::::: idealised : setups that conserving physical properties ::::::: quantities : like mass in the data assimilation framework using 215 the QPEns can significantly improve the estimate of the nature run. Here we show that it is possible to obtain similar positive results by training a CNN to conserve mass in a weak sense. By training on the unconstrained (EnKF)/constrained (QPEns) input/output pair, the CNN was already : is : able to reduce the mass violation significantly. However :::::::: Moreover, we found that adding a penalty term for mass violation in the loss function was :: is necessary in one of the two test cases to produce data assimilation results that are as good as those corresponding to the QPEns.
Code availability. https://github.com/wavestoweather/MSW_DA_ML Author contributions. All of the authors contributed to research behind this paper, as well as to writing of the manuscript. Reichstein, M., Camps-Valls, G., and Stevens, B.: Deep learning and process understanding for data-driven Earth system science, Nature,

Review response
December 15, 2020 1 List of relevant changes of the manuscript • Several additional references are included.
• The penalty term in the loss function has been redefined. The new penalty term equals the old penalty term divided by the number of grid points. This has led to re-training the CNN for this experiment and so the tables and plots have been updated. Conclusion remain mostly unchanged.
• New experiments were conducted to investigate the trade off between mass conservation and RMSE.

Answers to Marc Bocquet
Thank you for taking the time to thoroughly read our manuscript and for the positive feedback. We agree with your comments and have adjusted the manuscript accordingly, see below.

Possible improvements
This is a nicely written paper with a clear-cut organisation. The paper is convincing and well illustrated. Among possible improvements, I would list: • The manuscript may be a bit short and could benefit from more in-depth or additional experiments if relevant. We performed additional experiments to investigate the trade off between mass conservation and RMSE, which are now summarized in Figure 4. Note that we have changed the definition of the penalty term by comparing the mean fields of h, not the sum, so the penalty term is divided by n=250 now.
• A few relevant and more recent references could be added (recent is very short in this subject). We added the following references: • It would be much better to make the codes available for the sake of repeatability, as is customary in the machine learning community; maybe not all of them, since that may become tedious, but for instance the model and the machine learning code pieces. We will provide the code.
• The line and equations numbering could/should be corrected/improved. fixed Please see below for the details about these suggestions. Overall, I believe the manuscript only requires minor revisions but that they should be very carefully addressed.
2 Suggestions and typos: 1. l.4-6: In order to produce from a less computationally expensive, unconstrained analysis, a solution that is closer to the constrained analysis, we propose to use a convolutional neural network (CNN) trained on analyses produced by the QPEns.: The sentence is difficult to understand because: (i) there should not be a comma in between expensive, unconstrained (ii) closer: what do you compare to? This is confusing because of the beginning of the sentence; close may work better here. We rephrased to We therefore propose to use a convolutional neural network (CNN) trained on the difference between the analysis produced by a standard ensemble Kalman Filter (EnKF) and the QPEns to correct any violations of imposed constraints.
2. l.8-9: To obtain these positive results, it was in one case necessary to add a penalty term to the loss function of the CNN training process.: This is too vague a statement for an abstract. In my opinion, you should make it more precise or remove it (since the abstract is not long, the former is better). We removed it. . has actually been accepted as Brajard et al. (2020a). Can you please update the reference? fixed 8. l.36: combining NN with a knowledge based model as a hybrid forecasting approach (Pathak et al., 2018b): I believe Brajard et al. (2020b), which recently appeared, is also a very relevant citation to your manuscript because as opposed to Pathak et al. (2018) who rely on only one degree of freedom in model error and reservoir computing, Brajard et al. (2020b) have many degrees of model error freedom and rely on CNNs, like you do. We added the reference 9. l.75: Gaussian stochastic forcing β u has a half width of 4 grid points: Is this remark about correlation length of the covariance matrix? No, a Gaussian shaped term β u is added to the wind field at each model time step at a random location (see line 72 of the new manuscript).
What are µ and σ? You know that it can be ambiguous for log-normal distributions (whether you consider the variable of the log-variable). Good point. We rephrased: a lognormal error is added to the rain field with parameters of the underlying normal distribution µ=8 and σ= 1.5 11. l.87: using 5-th order polynomial function (Gaspari and Cohn, 1999): I believe that what you use is actually a 5-th piecewise rational function, is it? Thank you. We fixed it.
17. p.5: Equation defining the loss function (no number and line numbers skipped): Why do you take the square root and not the MSE which is available in TensorFlow/Keras? Because we also look at RMSE when verifying the data assimilation results, so this was just an easier direct comparison. We checked that using the MSE of TensorFlow/Keras yields similar results. However, the tables look very different when expressed in terms of MSEs instead of RMSEs. For example, the improvement in terms of RMSE is 32%, whereas in terms of MSE it is 59%.
(ii) Please give the reference to Chollets book instead, which is the Keras bible as well as an excellent introduction to TensorFlow/Keras and more generally deep learning (Chollet, 2017). fixed 19. It would be better to provide your codes. Maybe not all pieces, but for instance the original ones like the convection model and the TensorFlow code.
We will provide the code. 20. l.135 and Figure 2: Did you average your RMSEs over several learning and/or test experiments? It is possible that the curves are significantly dependent on the initial random seed. If not, I do not expect any unpleasant surprises but more reliable (and less noisy) curves, potentially with error bars. Please clarify. We averaged over 500 experiments. This information is now included in the Figure captions. 21. p.9; Table 2  Yes, that is a good point since the CNN has an influence radius of 5 grid points and the data assimilation a radius of 8 grid points. We therefore trained an additional CNN with the kernel of all layers of size 5, so that the influence radius is 10. This gave us however similar results as in Figure  5 and 6. We added this discussion in text: "Since the CNN only has an influence radius of 5 grid points and the localisation cut-off radius of the data assimilation is 8 grid points, it is possible that the better results of the CNN stem from this shorter influence radius. However, a CNN trained on the same data but with kernel sizes of 5 instead of 3 (leading to an influence radius of 10 grid points) yields similar results as in Figures 5 and 6 (not shown)." 23. l.175: the sentences are a bit awkward, I suggest (2 corrections): the CNN was able to reduce the mass violation significantly. Moreover, fixed 24. Acknowledgements: There seems to be a useless at the beginning. fixed

Answers to Svetlana Dubinkina
This is a well-written manuscript with very interesting results. My major comment is that this manuscript is rather short and that it could be extended to give more insightful results. Thank you for taking the time to thoroughly read our manuscript and for the positive feedback. We have taken your comments into account and have adjusted the manuscript accordingly, see below.
Major comments: 1. I would be in favour to see how the conclusions change depending of the grid size and the ensemble size.
The relative improvement of the QPEns over EnKF for different DA settings (including ensemble size) has been covered in Ruckstuhl and Janjic (2018). We added a sentence at the end of section 2.2: "We refer to Ruckstuhl and Janjic (2018) for a comparison of the performance of the EnKF and the QPEns as a function of ensemble size for different localisation radii, assimilation windows and observation coverage." We are confident that as long as the CNN can remove the bias in h, the CNN can match the performance of QPEns for any DA setting. One could then try to compare the differences in training process of the CNNs (does one setting require more data than the other?). However, a clean comparison among the different settings would require rigorous tuning of the architecture, the amount of training data needed, and the training process of the CNN. And this tuning is very tricky because we are interested in the performance of the CNN in DA context, not the value of the loss function.
Since we are working in a highly idealised setup, we want to be economical with time spent on fine tuning the CNNs. We therefore feel that we have exploited the modified shallow water model on this specific topic. Any further experiments should be done on more complex models. That being said, we did investigate in addition the trade off between mass and RMSE as you suggested in the next point and performed additional experiments that test the relation of the kernel size of the CNN to localisation.
2. There is a trade-off between mass conservation and low RMSE for u and h. What happens if in the experiments with the additional penalty term for mass conservation instead of a linear activation function for u and h, the relu activation function is used for both u and h as well as for r? Is the trade-off smaller then?
The reason we use the relu function for r is that the rain cannot be negative. This does not hold for u and h, so using the relu function for these variables is not an option. We did perform some additional experiments to investigate the trade off between mass conservation and RMSE, which is now summarized in Figure 4.
3. Authors remove the climatological mean from u and h. What happens if the climatological mean is not subtracted? Is the bias too high for the methods to handle? We want to clarify that we only subtract the mean to make the problem better conditioned for the training process. Since the difference between input and output data for the 3 variables differ in at most 1 order of magnitude, we do not expect huge problems if the mean is not subtracted. However, as far as we know, there is no disadvantage to normalizing the training data, which is why we have not tried training the CNN on the raw data.
Minor comments: 1. l.8: The last sentence of the abstract is rather vague. Please elaborate.
We have removed this sentence.
2. l.146: Does the loss function J account for the mass twice: in J and in the penalty term? J is the RMSE averaged over the 3 variables. This means that J accounts for the mass error indirectly (as the RMSE goes to zero, the mass error also goes to zero). The penalty term directly accounts for mass errors by first averaging the h field over the 250 grid points for y p and y separately, and then squaring the difference.
3. Please change γ to something else, since it is already reserved for the gravity wave speed. Yes, you are right. We changed it to η.
4. Why is the penalty term chosen in such a way, namely L1 norm and not L2 as in J? Note that J takes the norm of a vector of size 250, whereas the penalty term takes the norm of a scalar (namely the difference of the spatial mean of h). Therefore the L1 and L2 norm are equivalent for the penalty term.
5. If I look at Fig. 2(a) I see that NN is performing slightly better than QPEns. Is there an explanation for that? The QPEns is also not perfect, so it is possible that the CNN performs better. It is indeed then interesting to speculate why that is. Most of the last paragraph before the conclusion is dedicated to this (from line 174 to 184 in the new manuscript).
6. l.92: For the EnKF negative values for rain are set to zero if they occur. This is the variable r, if I understand correctly. However, if I look at Figure  7, I see negative values of r for EnKF. Could authors please explain?
The fields are shown before negative values are set to zero. We have clarified this in the caption 7. A table consistent of wall-clock time for different methods would be insightful for the computational cost gain. The costs of applying the NN are negligible with respect to the costs of the EnKF, as mentioned in line 50 of the new manuscript. So it is about the difference in computational costs between the EnKF and QPEns. Since we are working with a cheap model, no effort has been made in the implementation of the algorithms to make them computationally efficient. Therefore wall-clock times may be misleading. However we agree this is an important point and we actually have a paper under review that thoroughly discusses the computational costs of the QPEns. We therefore added a sentence in the introduction: "For a detailed discussion on the computational costs of the QPEns we refer to Janjic et al. (under review)".
8. I do not want to be self-promoted but authors could have a look at Dubinkina 2018 and decide if they would like to refer to it in their manuscript.
Thanks for mentioning this paper. We added now a reference to this manuscript.