Data-driven Reconstruction of Partially Observed Dynamical Systems

. The state of the atmosphere, or of the ocean, cannot be exhaustively observed. Crucial parts might remain out of reach of proper monitoring. Also, defining the exact set of equations driving the atmosphere and ocean is virtually impossible because of their complexity. Hence, the goal of this paper is to obtain predictions of a partially observed dynamical system, without knowing the model equations. In this data-driven context, the article focuses on the Lorenz-63 system, where only the second and third components are observed, and access to the equations is not allowed. To account to those strong constraints, 5 a combination of machine learning and data assimilation techniques is proposed. The key aspects are the following: the introduction of latent variables, a linear approximation of the dynamics, and a database that is updated iteratively, maximising the innovation likelihood. We find that the latent variables inferred by the procedure are related to the successive derivatives of the observed components of the dynamical system. The method is also able to reconstruct accurately the local dynamics of the partially observed system. Overall, the proposed methodology is simple, easy to code, and gives promising results, even in the 10 case of small amounts of observations.

as the deep ocean (see e.g., Jayne et al., 2017), and data-driven methods fail to make good predictions. To deal with those strong constraints, i.e., when the model is unknown and when the state is partially observed, an option is to use time-delay embedding of the available components of the system (Takens, 1981;Brunton et al., 2017), whereas another option is to find 25 latent representations of the dynamical system (see e.g., Talmon et al., 2015;Ouala et al., 2020). In this study, we will show that they are strong relationships between those two approaches.
Here, we propose a simple algorithm using linear and Gaussian assumptions, based on a state-space formulation. This classic Bayesian framework, used in data assimilation, is able to deal with a dynamical model (model-or data-driven) and observations (partial and noisy). Three main ideas are used: (i) augmented state formulation (Kitagawa, 1998), (ii) global 30 linear approximation of the dynamical system (Korda and Mezić, 2018), and (iii) estimation of the parameters using an iterative algorithm combined with Kalman recursions (Shumway and Stoffer, 1982). The proposed framework is probabilistic, where the state of the system is approximated using a Gaussian distribution (with a mean vector and a covariance matrix). The algorithm is iterative, where a catalog is updated at each iteration and used to learn a linear dynamical model. The final estimate of this catalog corresponds to a new system of variables. 35 The paper is organized as follows. Firstly, the methodology is explained in section 2. Secondly, section 3 describes the experiment using the Lorenz-63 system. Thirdly, the results are reported in section 4. The conclusions and perspectives are drawn in section 5.

Methods
The methodology proposed in this paper is borrowed from data assimilation, machine learning, and theory of dynamical 40 systems. It is summarized in Fig. 1 and explained below.
In data assimilation, the goal is to estimate, from partial observations y, the full state of a system x. When the dynamical model used to propagate x in time is available (i.e., when model equations are given), classic data assimilation techniques are used to retrieve unobserved components of the system. For instance, in the Lorenz-63 system (Lorenz, 1963), if only 2 variables (x 2 and x 3 in the example defined below) are observed, knowing the Lorenz equations (system of three ordinary differential 45 equations), it is possible to retrieve the unobserved one (x 1 in our example below). Now, if the model equations are not known and observations of the system are available over a sufficient period of time, it is possible to use data-driven methods to mathematically approximate the dynamic of the system. In this paper, a linear approximation is used to model the relationship of the state vector x between two time steps. It is parameterized with the matrix M, which dimension is equal to the square of the state-space. Moreover, a linear observation operator is introduced to 50 relate the partial observations y and the state x. It is written using a matrix H, with its dimension equal to the observation-space times the state-space dimensions. Figure 1. Schematic of the proposed methodology, illustrated using the Lorenz-63 system. The algorithm is initialized with a Gaussian random noise for the hidden component (i.e., z1) and with partial observations of the system (i.e., y2 and y3). Then, an iterative procedure is applied with a linear regression, a covariance computation, the Kalman recursions, and a random sampling. This algorithm is iteratively maximizing the likelihood of the observations noted L. After convergence of the algorithm, a hidden component z1 is stabilized and represented by a Gaussian distribution represented by the mean x s 1 and variance P s 1 .
Mathematically, matrices (M, H) and vectors (x, y) are linked using a Gaussian and linear state-space model such that

55
where t is the time index and η t and ϵ t are unbiased Gaussian vectors, representing the model and observation errors, respectively. Their error covariance matrices are noted Q and R, respectively. Those matrices indirectly control the respective weight given to the model and to the observations. It constitutes an important tuning part of the state-space models (see Tandeo et al., 2020, for a more in depth discussion).
In such a data-driven problem where only a part of the system is observed, a first natural step is to consider that the state x is directly related to the observations y. For instance, in the example of the Lorenz-63 previously introduced, observations correspond to the second and third components of the system (i.e., x 2 and x 3 , formally defined later).
In this paper, we propose to introduce a hidden vector noted z, corresponding to one or more hidden components that are not observed. To this purpose, the state is augmented using this hidden component z, the observation vector y does not change, and the operator H is a truncated identity matrix. The use of augmented state-space is classic in data assimilation and mostly refer to the estimation of unknown parameters of the dynamical model (see Ruiz et al., 2013, for further details).
The hidden vector z is now accounted in the linear model M, given in Eq. (1a), whose dimension has increased. The hidden components are completely unknown and thus randomly initialized using Gaussian white noises, parameterized by σ 2 , their level of variance. The next step is to infer z using a statistical estimation method. Starting from the random initialization, an iterative procedure is proposed, based on the maximization of the likelihood.

70
The proposed approach is based on a linear and Gaussian state-space model given in Eqs.
(1) and thus uses the classic Kalman filter and smoother equations. It is inspired by the Expectation-Maximization algorithm (noted EM, see Shumway and Stoffer, 1982) and is able to iteratively estimate the matrices M and Q. In this paper, R is assumed known and negligible. The criterion used to update those matrices is based on the innovations, defined by the difference between the observations y and the forecast of the model M, noted x f . The likelihood of the innovations, noted L, is written as: where Σ t = HP f t H ⊤ +R, with P f t = MP a t−1 M ⊤ +Q and P a t−1 corresponds to the state covariance estimated by the Kalman filter at time t − 1. The innovation likelihood given in Eq. (2) is interesting because it corresponds to the squared distance between the observations and the forecast normalized by their uncertainties, represented by the covariance Σ t .
At each iteration of the augmented Kalman procedure, the estimate of the matrix M is given by the least square estimator, 80 using a linear regression such that: where x (i−1) corresponds to the output catalog of the previous iteration (result of a Kalman smoothing and a Gaussian sampling, explained more in details below). Following Eq. (1a), the covariance Q is estimated empirically using the estimate of M given in Eq.
(3), such that: Then, a Kalman smoother is applied using the M (i) and Q (i) matrices estimated in Eq.
(3) and Eq. (4). At each time t, it results to a Gaussian mean vector x s t and a covariance matrix P s t . As input of the next iteration of the algorithm, the catalog x (i) is updated using a Gaussian random sampling using x s t and P s t at each time t. This random sampling is used to exploit the correlations between the components of the state vector and also to avoid being trapped in a local maximum, as in stochastic 90 EM procedures (Delyon et al., 1999).
The likelihood calculated at each iteration of the procedure increases until convergence. The algorithm is stopped when the likelihood difference between two iterations becomes small. The solution of the proposed method is the last Gaussian mean vectors x s t and covariance matrices P s t calculated at each time t. The component corresponding to the latent component z is finally retrieved, with an information about its uncertainty.
The methodology is tested on the Lorenz-63 system (Lorenz, 1963). This 3-dimensional dynamical system models the evolution of the convection (x 1 ) as a function of horizontal (x 2 ) and vertical temperature gradients (x 3 ). The evolution of the system is governed by three ordinary differential equations such as: In this paper, it is assumed that x 1 is never observed, only x 2 and x 3 are observed on a small period of time (  After that, a new white noise z 2 is used to augment the state such that x = [x 2 , x 3 , z 1 , z 2 ], the vector y = [y 2 , y 3 ] remains the same, and the iterative algorithm is applied until stabilization of z 2 . As long as the stabilized likelihood continues to increase with the addition of a hidden component, this augmented state procedure is repeated.

Results
Using the experiment presented in section 3, three hidden components z 1 , z 2 , and z 3 were sequentially added. They are 115 reported in Fig. 2, as well as the true Lorenz components x 1 , x 2 , and x 3 . Although they do not fit the hidden variable x 1 of the Lorenz-system, the two first hidden components z 1 and z 2 show time variations. On the contrary, z 3 is very flat with a large confidence interval. This suggests that our method has identified that 2 hidden variables are enough to retrieve the dynamics of the 2 observed variables. This is confirmed by the evaluation of the likelihood of the observations y 2 and y 3 with different linear models, obtained with 120 or without the use of hidden components z (Fig. 3). As the proposed method is stochastic, 50 independent realizations of the likelihood are shown for each experiment. The 50 realizations vary from the random values given to the added hidden variable at the beginning of the iterative procedure. In the naive case where the state of the system is [x 2 , x 3 ] (black dashed line), the likelihood is small. Then, adding successively z 1 (green lines) and z 2 (red lines), after 30 iterations of the proposed algorithm, the likelihood significantly increases. Finally, the inclusion of z 3 reduces the likelihood (purple lines). Those results indicate that the best linear model to predict the variations of the observations y 2 and y 3 is the one using two hidden components. Thus, for the rest of the paper, the focus is thus done on the model with the following augmented state x = [x 2 , x 3 , z 1 , z 2 ].
To compare more precisely the performance of the naive linear model M with [x 2 , x 3 ] and the one with [x 2 , x 3 , z 1 , z 2 ], their forecasts are evaluated. The distance between the forecasts and the truth (i.e., the error) is computed at each time t in the (x 2 , x 3 ) space (using the observation operator H), such that where ∥.∥ represents the Euclidean norm. The errors from model using [x 2 , x 3 ] to model using [x 2 , x 3 , z 1 , z 2 ] reduce significantly (by half in average, not shown). However, this error reduction is not homogeneous in the attractor. Figure 4 indicates when the two models are similar (values close to 0) and when model including z 1 and z 2 is better (values close to 1). The improvement is moderate in the outside of the wings of the attractor, important in the wing-transition, and almost not changed 135 in inside of the wings (e.g., for x 2 close to 10). The question is now: what is the significance of those hidden components z 1 and z 2 estimated using the proposed methodology? Are they correlated with the unobserved component x 1 or with the observed one x 2 and x 3 ? Are they somehow proxies of the unobserved component? It has been found that the hidden components z correspond to linear combinations of the derivatives of the observations such that: When developing Eq. (7b) using Eq. (7a), the second hidden component writes z 2 = b 2ẋ2 + b 3ẋ3 + b 1 a 2ẍ2 + b 1 a 3ẍ3 . It shows that z 1 uses the first derivative of x 2 and x 3 , whereas z 2 uses the second derivatives. This result makes the link with Taylor's and Takens' theorem, which shows that an unobserved component (i.e., x 1 ), can be replaced by the observed components (i.e.,

145
x 2 and x 3 ) at different time lags. Note that due to the stochastic behaviour of the algorithm, the a and b coefficients are not fixed and several combination of them can reach to the same performance in term of likelihood. This is illustrated in Fig. (3), with 50 independent realizations of the proposed algorithm. When considering only z 1 (green lines), the algorithm converges to various solutions, mainly restricted around two solutions (corresponding to a minimum and a maximum of likelihood). Those minimum and maximum likelihoods correspond to a 3 ≈ 0 and to a 2 ≈ 0, respectively. This suggests thatẋ 3 is more important thanẋ 2 to explain the variations of the Lorenz system (this is consistent with investigation of Sévellec and Fedorov, 2014, in a modified version of Lorenz-63 model). Then, when considering z 1 and z 2 (red lines), the 50 independent realizations reach the same likelihood after 30 iterations. It means that if the algorithm focuses on the estimation of a 2 when considering only z 1 , it will then focuses on b 3 when introducing z 2 ; in terms of forecast performance, this is similar to firstly focus on a 3 and then b 2 , because the final likelihood values are similar.
In this article, the goal is to retrieve hidden components of a dynamical system that is partially observed. The proposed methodology is purely data-driven, not model-driven (i.e., without the use of any equations of the dynamical model). It is based on the combination of data assimilation and machine learning techniques. Three main ideas are used in the methodology: an augmented state strategy, a linear approximation of a dynamical system, and an iterative procedure. The methodology is easy 160 to implement, using simple strategies and well established algorithms: Kalman filter and smoother, linear regression using least squares, iterative procedure inspired from the EM recursions, and Gaussian random sampling for the stochastic aspect.
The methodology is tested on the Lorenz-63 system, where only two components of the system are observed on a short period of time. Several hidden components are introduced sequentially in the system. Although the hidden components are initialized randomly, only a few iterations of the proposed algorithm is necessary to retrieve a relevant information. The recovered 165 components are expressed with Gaussian distributions. The new components correspond to linear combinations of successive derivatives of the observed variables. This result is consistent with the theorems of Taylor and Takens which show that time delay embedding is useful to improve the forecasts of the system. In our case, this is evaluated using the likelihood: a metric to evaluate the innovation (i.e., the difference between Gaussian forecasts and Gaussian observations).
Using our methodology, we do not retrieve the true missing Lorenz component and need two hidden variables to represent a single missing one. The reason of this mismatch is two-fold and is mainly due to the linear approximation of the dynamical system, which implies that: (1) the true missing component, that does not have to be linear combinations of the observed variables, is impossible to retrieve in our framework and (2) two variables, using combinations of the time derivatives of the observed variables, are needed to accurately represent the complexity of the dynamics. However, it is important to note that, even if two variables are needed to replace a single one, the dynamical evolution of the system is retrieved with our 175 methodology. This correct representation of the evolution might ultimately be the most important (e.g., for accurate and reliable forecasting).
The proposed methodology is using a strong assumption: the linear approximation of the dynamical system is global (i.e., fixed for the whole observation period). A perspective is to use adaptive approximations of the model using local linear regressions. This strategy is computationally more expensive because a linear regression is adjusted at each time step, but shows some 180 improvements in chaotic systems (see Platzer et al., 2021). In this context of adaptive linear dynamical model, the proposed methodology could be easily plugged into an ensemble Kalman procedure based on analog forecasts (Lguensat et al., 2017).
As stated in the introduction, in lot of problems in geophysics, model equations are not available or difficult to manipulate (e.g., primitive equations), but time series of partial observations exist. The proposed method is promising to reconstruct a consistent set of variables when remote, complex dependencies exist (e.g., mean-eddy flow interactions as discussed in Chen 185 et al., 2014) or unobserved, small-scale impact is unknown (e.g., turbulent closure as discussed in Zanna and Bolton, 2020).
In these context, dynamics of atmospheric and oceanographic systems will be investigated in the future. The next step will be to test the proposed methodology on concrete problems and see if the retrieved hidden components correspond to realistic unmeasurable quantities that could drive the dynamics of those systems.
Author contributions. Pierre Tandeo wrote the article. Pierre Tandeo and Pierre Ailliot developed the algorithm. Florian Sévellec and Pierre