the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Representation learning with unconditional denoising diffusion models for dynamical systems
Tobias Sebastian Finn
Lucas Disson
Alban Farchi
Marc Bocquet
Charlotte Durand
We propose denoising diffusion models for datadriven representation learning of dynamical systems. In this type of generative deep learning, a neural network is trained to denoise and reverse a diffusion process, where Gaussian noise is added to states from the attractor of a dynamical system. Iteratively applied, the neural network can then map samples from isotropic Gaussian noise to the state distribution. We showcase the potential of such neural networks in proofofconcept experiments with the Lorenz 1963 system. Trained for state generation, the neural network can produce samples that are almost indistinguishable from those on the attractor. The model has thereby learned an internal representation of the system, applicable for different tasks other than state generation. As a first task, we finetune the pretrained neural network for surrogate modelling by retraining its last layer and keeping the remaining network as a fixed feature extractor. In these lowdimensional settings, such finetuned models perform similarly to deep neural networks trained from scratch. As a second task, we apply the pretrained model to generate an ensemble out of a deterministic run. Diffusing the run, and then iteratively applying the neural network, conditions the state generation, which allows us to sample from the attractor in the run's neighbouring region. To control the resulting ensemble spread and Gaussianity, we tune the diffusion time and, thus, the sampled portion of the attractor. While easier to tune, this proposed ensemble sampler can outperform tuned static covariances in ensemble optimal interpolation. Therefore, these two applications show that denoising diffusion models are a promising way towards representation learning for dynamical systems.
 Article
(3636 KB)  Fulltext XML
 BibTeX
 EndNote
The ultimate goal of generative modelling is to generate samples from the distribution that has generated given training samples. Given this goal, we can train deep neural networks (NNs) for unconditional generation of states from the attractor of a dynamical system. Their further use beyond generating states remains ambiguous. Here, we reason that they learn an internal representation of the attractor. Instantiating denoising diffusion models (DDMs) for the Lorenz 1963 system (Lorenz, 1963), we use the learned representation in downstream tasks, namely surrogate modelling and ensemble generation.
DDMs are trained to imitate the process of generating samples from the attractor of a dynamical system (Y. Song et al., 2021), as depicted in Fig. 1a. During training, the available state samples are diffused by a predefined Gaussian diffusion kernel, and the NN is trained to denoise the diffused samples (SohlDickstein et al., 2015; Ho et al., 2020). After training, we can iteratively apply the sotrained NN to map samples from a normal distribution to samples like drawn from the attractor. This generation process is expected to be successful if there is an invariant state distribution on the system's attractor, which exists for ergodic chaotic dynamics. Akin to integrating a stochastic differential equation in (pseudo) time, the NN defines the integrated dynamical system.
Using this correspondence between dynamical systems and DDMs, we can replace the drift in the diffusion process by an integration of a real dynamical system (Holzschuh et al., 2023). The denoising process can then invert and integrate the system backward in physical time. Furthermore, DDMs can emulate fluid flows as simulated by computational fluid dynamics (Yang and Sommer, 2023) and estimate a spatial–temporal prediction of such flows (Cachay et al., 2023). DDMs are additionally connected to the Schrödinger Bridge and can be designed to map between arbitrary probability distributions (Bortoli et al., 2021; Chen et al., 2023). Notably, the Schrödinger Bridge has been already exploited for discrete and continuoustime data assimilation (Reich, 2019).
Generative modelling is a special case of selfsupervised learning with the task of generating new state samples. Typically used for pretraining and representation learning, one of the promises of unsupervised and selfsupervised learning is to learn deep NNs from large, heterogeneous datasets without the explicit need of supervision. Such methods allow the use of NNs with millions of parameters for specific geoscientific problems (Hoffmann and Lessig, 2023; Nguyen et al., 2023), where often not enough labelled data are available to train deep NNs from scratch. Since training deep generative models remains difficult yet, generative training is less often used for pretraining and representation learning of highdimensional systems than other methods like contrastive learning (e.g. Chen et al., 2020). DDMs offer a method for stable generative training and can generate highquality samples (Dhariwal and Nichol, 2021; Nichol and Dhariwal, 2021). Hence, they have the potential to pave the way towards representation learning with generative models for highdimensional systems.
DDMs are directly linked to denoising autoencoders (Vincent et al., 2008, 2010). These autoencoders train a NN to predict cleaned state samples out of noised ones; the NN must learn relevant features about the state distribution itself. These features are then useable in tasks different from denoising (Bengio et al., 2013; Alain and Bengio, 2014). The idea to reconstruct from corrupted data is further the leading paradigm in pretraining large language models (Radford et al., 2018; Devlin et al., 2019; Dong et al., 2019) and, recently, also used for highdimensional image data with masked autoencoders (He et al., 2021). Based on these ideas, DDMs that are trained to generate images can extract useful features for downstream tasks (e.g. Baranchuk et al., 2022; Zhang et al., 2022; Xiang et al., 2023). Concurrently to this study, Mittal et al. (2023) and Yang and Wang (2023) propose to directly use DDMs for representation learning from images. However, to our knowledge, we are the first introducing these models for representation learning from dynamical systems.
In our first downstream task, we follow along these lines and apply the denoising NN as a feature extractor for surrogate modelling, as schematically shown in Fig. 1b. Initially pretrained to generate states, we finetune the NN by replacing its last layer by a linear regression or a shallow NN. This way, we achieve a similar performance to that of deep neural networks trained from scratch.
In our second downstream task, we apply DDMs to generate state ensembles. Ensemble forecasting is one of the cornerstones for the recent advances in numerical weather prediction and data assimilation (Bauer et al., 2015), yet it is much more expensive than running a deterministic forecast. Ensemble optimal interpolation approaches (Evensen, 2003; Oke et al., 2002) lower the computational costs by applying climatological ensembles to assimilate observations into a deterministic forecast. The ensemble can be either directly drawn from the climatology or constructed by analogous methods (Lguensat et al., 2017; Tandeo et al., 2023). Another method to generate an ensemble for data assimilation would be to make use of the knowledge about the system's error propagation, in form of singular vectors (Molteni et al., 1996) or bred vectors (Toth and Kalnay, 1993), as similarly used to initialize ensemble weather forecasts (Buizza et al., 2005) or subseasonal forecasts (Demaeyer et al., 2022). Alternatively, we can generate the ensemble members from the latent space of a variational autoencoder (VAE; Grooms, 2021; Yang and Grooms, 2021; Grooms et al., 2023).
In fact, DDMs are a type of (hierarchical) VAE with an analytically known encoding distribution (Kingma et al., 2021; Luo, 2022). Thus, the latent space of DDMs is similar to the latent space of VAEs, and image data can be interpolated in and reconstructed from this latent space (J. Song et al., 2021). Mapping through this latent space by diffusing and denoising, we can perform imagetoimage translation without the need for paired data (Meng et al., 2022).
Instead of imagetoimage translation, we can also generate an ensemble out of a deterministic run with a DDM. We partially diffuse the run for a predefined amount of time. The mapping from state sample to diffused sample is inherently stochastic, and we can generate many diffused ensemble members. Afterwards, we iteratively apply the denoising NN to map the diffused ensemble back into state space. Varying the amount of time, we have control over the uncertainty in the generated ensemble. We apply this ensemble sampling method in ensemble optimal interpolation to update a deterministic forecast. We demonstrate that sogenerated ensemble members can outperform members drawn from tuned climatological covariances for ensemble optimal interpolation.
In concurrent work from Rozet and Louppe (2023), the state generation with DDMs is guided towards observations. However, their prior distribution is defined by the climatology of their DDM, unconditional from any forecast run. To condition the denoising diffusion model on forecasts, Bao et al. (2023) retrain the network for each state update step. This retraining increases the computational costs of the data assimilation scheme and needs many ensemble samples. By contrast, we condition the ensemble generation on a deterministic run using partial diffusion without the need to retrain the model.
We elucidate on the theory of denoising diffusion models in Sect. 2, where we additionally elaborate on different options for sampling and parameterizations of the NN output. In Sect. 3, we introduce our two methods to use the learned internal representation. There, we illustrate how the denoising NN is applied as a feature extractor and how DDMs can generate an ensemble out of a deterministic run. Our experiments with the Lorenz 1963 system are described and analysed in Sect. 4. We summarize this work and discuss its broader impacts in Sect. 5, and we briefly conclude in Sect. 6.
Our goal is to generate state samples x as drawn from the attractor of a dynamical system. The distribution of states on the attractor is described by p_{data}(x). This state distribution is unknown, and, instead, we rely on k existing samples x_{1:k}.
To generate state samples, we train deep neural networks (NNs) as denoising diffusion models (DDMs). Their general idea for training is to progressively add noise to the training samples in a Gaussian diffusion process. This introduces a pseudotime and results in noised samples z_{τ} at a given step τ. The NN f_{θ}(z_{τ},τ) with its parameters θ is trained to reverse the diffusion process and to denoise the samples for a single step. One denoising step can be described as follows:
here defined as a Gaussian distribution with mean μ_{θ}(z_{τ},τ) as a function of the NN output and Σ_{τ} as the pseudotimedependent covariance matrix. We will further discuss the parameterization of the NN output in Sect. 2.4.
After the NN is trained, we can start to sample from a known prior distribution p(z_{T}) and iteratively apply the NN for T steps to denoise these samples towards the state space. This iterative sampling scheme results in the trajectory z_{0:T} with its joint distribution,
In the following, we define the algorithm step by step by briefly explaining the diffusion process in Sect. 2.1 and the training of the denoising network in Sect. 2.2. We introduce two different sampling schemes in Sect. 2.3 and different parameterizations of the NN output in Sect. 2.4.
2.1 Gaussian diffusion process
The diffusion process is defined in terms of intermediate latent (noised) states z_{τ} with $\mathit{\tau}\in [\mathrm{0},T]$ as discrete pseudotime steps. As these latent states are noised state samples, they still lie in state space, and we define z_{0}=x.
The diffusion process progressively adds small Gaussian noise, ${\mathit{\u03f5}}_{\mathit{\tau}}\sim \mathcal{N}(\mathrm{0},{\mathit{\sigma}}_{\mathit{\tau}}^{\mathrm{2}}\mathbf{I})$, to the states, where σ_{τ} describes the amplitude of the added noise at pseudotime τ. Since the noise accumulates, the variance of the states would increase with pseudotime. Instead, here we use a variancepreserving formulation, where the signal is progressively replaced by noise. The signal magnitude is decreased in pseudotime $\mathrm{1}\ge {\mathit{\alpha}}_{\mathit{\tau}\mathrm{1}}>{\mathit{\alpha}}_{\mathit{\tau}}\ge \mathrm{0}$ with ${\mathit{\sigma}}_{\mathit{\tau}}=\sqrt{\mathrm{1}{\mathit{\alpha}}_{\mathit{\tau}}}$, such that the variance remains the same for all pseudotime steps, if the state samples are normalized. The function that defines the signal magnitude as a function of the pseudotime step is called noise scheduler. To simplify the derivation in the following, we assume a given noise scheduling and show the definitions of the two noise schedulers used in Appendix C and refer to Sect. 3.2 of Nichol and Dhariwal (2021) for a more detailed discussion.
The transition of the latent state z_{τ} from pseudotime τ−1 to pseudotime τ with $\mathit{\tau}\ge \mathit{\tau}\mathrm{1}$ is then given as
with the relative signal magnitude ${\mathit{\alpha}}_{\mathit{\tau}}^{\prime}=\frac{{\mathit{\alpha}}_{\mathit{\tau}}}{{\mathit{\alpha}}_{\mathit{\tau}\mathrm{1}}}$. Using the additive property of Gaussian distributions, the distribution of the latent state z_{τ} at step τ can be directly defined given a state sample x and a signal magnitude α_{τ},
Setting the signal magnitude at the last step near zero, α_{T}≈0, the signal vanishes, and the latent states converge towards a normal Gaussian distribution, which then also defines our prior distribution for the diffusion process,
Given the transition distribution from Eq. (3), the joint distribution of the trajectory for the diffusion process forward in pseudotime reads
The NN is then trained to reverse a single step of this trajectory, such that we can start from the prior distribution, Eq. (5), and generate the trajectory without needing access to the state sample x, as we will see in the following section.
2.2 Training procedure
During training, we sample a latent state from the trajectory by drawing a state x∼p_{data}(x); noise $\mathit{\u03f5}\sim \mathcal{N}(\mathrm{0},\mathbf{I})$; and a pseudotime τ, which specifies the signal magnitude α_{t}. Making use of the reparameterization property of Gaussian distributions, we can write the latent state drawn from its distribution q(z_{τ}∣x) as
To describe the analytical denoising step from τ to τ−1, we use Bayes' theorem given the definition of the diffusion process, as defined in Eq. (3) and Eq. (4), and a known state sample x,
Note that, during generation, the state x is unknown, and we have to approximate Eq. (8b) to generate data, as we discuss in the following.
The diffusion and the denoising process are defined over several signal magnitudes. We train one NN for all signal magnitudes and use the pseudotime as additional input into the NN. Here, we parameterize the NN to predict the drawn noise, ${\widehat{\mathit{\u03f5}}}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})={f}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})$ based on the current latent state and pseudotime. We introduce other parameterizations in Sect. 2.4.
To approximate the analytical denoising step from Eq. (8a) with the NN, we have to specify the unknown state x by the NN output. Using the predicted noise, the latent state can be directly projected onto the state using Tweedie's formula (Efron, 2011),
Replacing the state by this prediction in the mean function, Eq. (8b), we have completely specified the denoising distribution with predicted quantities,
The goal is to make the approximation, Eq. (10), as close as possible to the analytical denoising step from Eq. (8a) using the Kullback–Leibler divergence between the approximation and the analytical step,
By definition, the covariance of the approximated and analytical denoising step match, and the Kullback–Leibler divergence, Eq. (11), reduces to a meansquared error loss between the predicted noise and the used randomly drawn noise,
with the weighting factor $w\left(\mathit{\tau}\right)=\frac{{\mathit{\alpha}}_{\mathit{\tau}}}{\mathrm{1}{\mathit{\alpha}}_{\mathit{\tau}}}$ as the signaltonoise ratio. In practice, this weighting factor is neglected (Ho et al., 2020), which leads to a simplified loss function.
The NN is trained for all pseudotime steps to achieve its optimal parameters θ^{⋆}. For a single training step, we minimize the expectation of the simplified loss function of Eq. (12),
where 𝒰(1,T) is a uniform distribution with 1 and T as bounds. Equation (12) can be derived from the socalled evidence lower bound (Kingma et al., 2021; Luo et al., 2023), and we optimize a weighted lower bound to the unknown distribution p_{data}(x) of the states on the attractor with Eq. (13). Consequently, we can expect that the better the prediction of the NN, the nearer the generated state samples to the attractor of the dynamical system are.
2.3 Sampling from the denoising process
After training the NN, we can use it to sample from the denoised state trajectory distribution p_{θ}(z_{0:T}), defined in Eq. (2). To sample from the distribution, we can start sampling from the prior distribution ${\mathit{\u03f5}}_{T}\sim \mathcal{N}(\mathbf{0},\mathbf{I})$ and then sample for each subsequent denoising step from Eq. (10),
This sampling process is inherently stochastic and called the denoising diffusion probabilistic model (DDPM; Ho et al., 2020) in the following.
To reduce the magnitude of randomness during training, the sampling process can be made deterministic (J. Song et al., 2021; Y. Song et al., 2021). This deterministic sampling scheme is called the denoising diffusion implicit model (DDIM; J. Song et al., 2021), and its only source of randomness is the sampling from the prior distribution. The marginal distribution of the generated state samples remains the same, and the model can still be trained by the same loss function as defined in Eq. (13).
In DDIMs, the noise ϵ_{τ} drawn from a Gaussian distribution is replaced by the predicted noise from the NN $\widehat{\mathit{\u03f5}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})$, also used to predict the state ${\widehat{\mathit{x}}}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})$. We can introduce an additional factor $\mathit{\eta}\in [\mathrm{0},\mathrm{1}]$, which determines the randomness in the sampling process. Given this factor, we can sample from the denoising steps as follows:
The factor interpolates between purely deterministic sampling with DDIMs, η=0, and fully stochastic samples, η=1. Sampling with Eq. (14) has an even larger randomness than η=1. Note, sampling with η=1 and the sampling with Eq. (14) introduced earlier are both DDPMs. For simplicity, we refer to Eq. (14) as DDPM, whereas we call sampling with η=1 a stochastic DDIM scheme. Throughout the paper, most of the time we sample with Eq. (14), whereas we also perform experiments with DDIM schemes.
2.4 Output parameterizations
Usually, the output of the NN is parameterized as prediction ${\widehat{\mathit{\u03f5}}}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})={f}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})$ of the noise (Ho et al., 2020). Here, we introduce two additional parameterizations and discuss their advantages and disadvantages. In our implementation, a different parameterization also changes the loss function for the NN. The change in the loss function then modifies the implied weighting of the Kullback–Leibler divergence in Eq. (12), as shown in Fig. 2a.
In Eq. (9), we have defined the predicted state as a function of the predicted noise. Instead, we can also directly predict the state ${\widehat{\mathit{x}}}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})={f}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})$. With this parameterization, we minimize the meansquared error of the predicted state to the true state during training, which gives a constant weighting of the Kullback–Leibler divergence, shown as a red curve in Fig. 2a.
The NN is trained to split the signal and noise from a given latent state. There, we could directly predict either the state (signal) or the noise. Nonetheless, we can alternatively define a combination of both as the target (Salimans and Ho, 2022),
Predicting $\widehat{\mathit{v}}(\mathit{x},\mathit{\u03f5},\mathit{\tau})={f}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})$ and minimizing the meansquared error between prediction and true v interpolates the weighting between noise (τ=0) and state (τ=1000) prediction, shown as a black curve in Fig. 2a.
Since the state is needed in the denoising step, Eq. (8a), predicting the state is a straightforward parameterization for training and applying the NN. However, the distribution of the state might be nonGaussian and multimodal, as shown in Fig. 2b, such that the Gaussian approximation for the loss function could be inadequate. By contrast, the noise is drawn from a Gaussian distribution, and the meansquared error is the correct loss function, statistically speaking. Additionally, Ho et al. (2020) have shown that predicting the noise leads to better results than directly predicting the state.
However, predicting the noise can be highly unstable for low signaltonoise ratios (Salimans and Ho, 2022); the v prediction weights the loss function differently and circumvents these instabilities. As the target is a combination between state and noise, the target distribution shifts from nonGaussian state prediction for small signal magnitudes to Gaussian noise prediction for large signal magnitudes, Fig. 2b. Consequently, the use of the v prediction can be advantageous for the training stability and the loss function, which may improve the performance of the DDMs.
In this paper, we train DDMs to generate states that should lie on the attractor of the dynamical system. As training data, we integrate the equations that define the dynamical system to produce a long state trajectory. Using each state of the trajectory at discrete time as training sample, the DDM is trained to unconditionally generate state samples. We reason that the DDM must have learned an internal representation of the attractor. In Appendix F, we analyse the extracted features of the denoising NN and show that this representation is entangled; we need all extracted features to extract information about the attractor. In the following, we explain two different approaches on how the unconditional DDM can be used for downstream tasks other than pure state generation.
First, in Sect. 3.1, we demonstrate the use of the denoising NN for transfer learning; we finetune it for surrogate modelling. Secondly, in Sect. 3.2, we generate a state ensemble from a deterministic forecast run with the DDM.
3.1 Transfer learning from the denoising neural network
As schematically shown in Fig. 3, our general idea of transfer learning the NN is to remove its last layer. This last layer combines the extracted features ϕ(z_{τ},τ) at a specific pseudotime step τ by the weights W and the bias β to the NN output ${f}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})={\mathbf{W}}^{\top}\mathit{\varphi}({\mathit{x}}_{\mathit{\tau}},\mathit{\tau})+\mathit{\beta}$.
Since the noised states of the DDM remain in state space, the network can be easily applied to cleaned states, instead of working with noised states. Keeping the pseudotime step fixed, we can extract features ϕ(x_{t}) from a given state x_{t} at a physical time step t with the NN by removing its last layer. For the task of surrogate modelling, we regress these extracted features to the next state x_{t+1}, one time step later. As the tuning parameter for the feature extractor, we can select the pseudotime step and concatenate features at multiple pseudotime steps.
The increasing noise magnitude with pseudotime forces the network to extract fine features for small pseudotime steps and coarse features for large pseudotime steps. We visualize this in Fig. 4a, where we project the activation of an arbitrary neuron onto the x–z plane of the Lorenz 1963 model. The model extracts different features at different pseudotime steps, even for smaller pseudotime steps like τ=0 and τ=200. Nevertheless, the NN extracts more complex feature for such smaller pseudotime steps, whereas the extracted features are more linearly separated for larger pseudotime steps.
To test the linearity of the extracted features, we fit a linear regression from state space x_{t} to extracted feature space ϕ(x_{t},τ), such that the following relation should approximatively hold:
The larger the pseudotime step, the smaller the error of the linear regression (Fig. 4b), and the better the features can be linearly predicted from the state space. The network extracts the most nonlinear features for an intermediate pseudotime step around τ=200. These results confirm the visual results in Fig. 4a.
The ordinary differential equations of the Lorenz 1963 system include secondorder polynomial terms, and, integrated in time, the influence of the nonlinearities in the system increases with lead time. To learn a surrogate model for multiple integration time steps, we need to extract nonlinear features. Consequently, we can expect that if the features are more nonlinear, they can be better used for surrogate modelling.
To test this hypothesizes, we fit a linear regression from feature space to the dynamics of the model after 10 integration time steps, Δt=0.1 MTU (model time units), such that the following relation should approximatively hold:
The smaller its error, the more useful the features for surrogate modelling are.
As the dynamics are nonlinear, we also need nonlinear features for surrogate modelling. Consequently, the larger the pseudotime step, the less the predictions can explain the dynamics of the system and the larger the regression error, as can be seen in Fig. 4c. The features most linearly linked to the dynamics are around an intermediate pseudotime step τ=400. As different features at different pseudotime steps are extracted, we propose to extract features at multiple pseudotime steps between τ=0 and τ=600. These features are concatenated as predictors in a linear regression or small NN for surrogate modelling.
3.2 Ensemble generation by diffusing and denoising a deterministic forecast run
Beside the feature space of the denoising NN, the latent space also encodes useful information for other tasks than the network was trained on (e.g. J. Song et al., 2021). In a second approach, we use the latent space to generate a state ensemble from a deterministic forecast run. This approach resembles the approach of SDEdit (Meng et al., 2022) to guide the editing of images with DDMs.
Our idea is to partially diffuse a deterministic forecast until a given signal magnitude α_{τ} is reached and to reconstruct an ensemble out of the latent space. The diffusion process from state to latent state is intrinsically stochastic and, thus, a case of onetomany mapping. Taking samples in the latent space, we reconstruct an ensemble by iteratively applying the denoising network for the same number of pseudotime steps as used to diffuse the deterministic forecast, as schematically shown in Fig. 5a.
The denoising network is statedependent, which also makes the DDM for ensemble generation statedependent, as shown in Fig. 5b. Trained for state generation only, the DDM has never seen any timedependent relationships between samples. Consequently, the relationship between samples is purely induced by the climatology, and the state dependency hardly translates into a flow dependency.
The denoising process is trained to generate states on the attractor of the dynamical system. The chosen pseudotime step consequently controls the sampled portion of the attractor. Because of the state dependency, the resulting distribution is implicitly represented by the ensemble and could extend beyond a Gaussian assumption. We formalize the ensemble generation and the implicit distribution representation in Appendix B, showing its connection to a Bayesian framework.
The bigger the pseudotime step, the smaller the signal magnitude, and the more diffused the deterministic run is, which controls the degree of uncertainty in the ensemble. For a very small pseudotime step with a signal magnitude near 1, α_{τ}≈1, almost no noise would be added, and we would end up with a very small ensemble spread. For a large pseudotime step with a signal magnitude near zero, α_{τ}≈0, almost all data would be replaced by noise in the latent state; the generated ensemble would correspond to a climatological ensemble. In general, the choice of the pseudotime step is similar to the covariance inflation factor in an ensemble data assimilation system.
We test this ensemble generation approach in data assimilation with an ensemble Kalman filter. In fact, this methodology is an ensemble optimal interpolation approach (EnOI; Evensen, 2003; Oke et al., 2002). Instead of specifying an explicit covariance or providing states drawn from a climatology, the samples generated with the DDM implicitly represent the prior distribution for the data assimilation.
We showcase the potential of DDMs for representation learning in geoscientific systems with three different type of experiments. In the state generation experiments (Sect. 4.1), we establish the methodology of DDMs. We test different settings for the denoising network and compare these results to the best practices in computer vision for image generation. Afterwards, two downstream applications are built around the bestperforming denoising network. In the transfer learning experiments (Sect. 4.2), we use the pretrained denoising network as a feature extractor for surrogate modelling of the Lorenz 1963 system; see also Sect. 3.1. In the ensemble generation experiment (Sect. 4.3), the DDM is combined with an ensemble optimal interpolation to assimilate observations into a deterministic forecast. Using these data assimilation experiments, we can assess how well the DDM can generate an ensemble out of deterministic forecasts; see also Sect. 3.2.
We perform all experiments with the Lorenz 1963 model (Lorenz, 1963). Its dynamical system has three variables, x, y, and z, and is defined by the following set of ordinary differential equations, where we use the standard parameters,
The chosen parameters induce a chaotic behaviour with an error doubling time of 0.78 MTU (model time units). We integrate the dynamical system with a fourthorder Runge–Kutta integrator and an integration time step of 0.01 MTU.
We base our experiments on an ensemble of 33 trajectories (16 for training, 1 for validation, and 16 for testing), initialized with random states, sampled from 𝒩(0,(0.001)^{2}I). The first 1×10^{5} integration steps are omitted as spinup time. We integrate the system with an additional 1×10^{6} steps to generate the states needed for training, validation, and testing. This way we generate 1.6×10^{7} samples for training, 1×10^{6} for validation, and 1.6×10^{7} for testing. This large number of samples allows us to test settings without being constrained by data. Before training, the data are normalized by the mean and standard deviation estimated based on the training dataset. The code is developed in Python (Van Rossum, 1995), using PyTorch (Paszke et al., 2019) and PyTorch lightning (Falcon et al., 2020), and is publicly available at https://github.com/cereadaml/ddmattractor (last access: 17 September 2024, Finn, 2023).
4.1 State generation
As the denoising network, we use a ResNetlike architecture (He et al., 2015) with fully connected layers; for more information, see Appendix D1. To condition the network on the pseudotime step, we encode the discrete pseudotime ($\mathit{\tau}\in [\mathrm{0},\mathrm{1000}]$) by a sinusoidal encoding (Vaswani et al., 2017). The encoded pseudotime modulates via a linear function to the scale and shifting parameters of the layer normalizations in the residual layers. In total, the denoising network has 1.2×10^{6} parameters, a very large number of parameters for the Lorenz 1963 system. However, we are in a training data regime with a very large number of samples, rolling out the state generation experiments without worrying about underfitting of the network.
The networks are trained with the Adam (Kingma and Ba, 2017, $\mathit{\gamma}=\mathrm{3}\times {\mathrm{10}}^{\mathrm{4}}$) optimizer for 100 epochs with a batch size of 16 384. To reduce the amount of randomness in the results, each experiment is performed 10 times with different seeds, which randomize the initial weights for the neural network, the order of the samples within one epoch, and the noise added to the samples during training.
If not differently specified, we sample 1×10^{6} states with the DDPM scheme, defined in Eq. (14), for T=1000 pseudotime steps, which the networks are trained for. These generated states are compared to the testing samples using five different metrics; for exact definitions, see also Appendix E.
We compare how near the generated states are to the attractor using the Hellinger distance H between the generated state distribution and the testing state distribution (Arnold et al., 2013; Gagne II et al., 2020); to estimate the distance, we discretize the state into cubes (Scher and Messori, 2019). The Hellinger distance is bounded $\mathrm{0}\le H\le \mathrm{1}$ with H=0 if the distributions perfectly correspond to each other and H=1 if there is no overlap.
To measure the perceptual quality of the generated states, we adapt the Fréchet inception distance to our Lorenz 1963 settings, in the following called Fréchet surrogate distance (FSD). Replacing the inception network, we estimate the Fréchet distance in feature space spanned by a dense neural network with two hidden layers, trained for surrogate modelling. The smaller the distance, the better match the statistics of the generated state distribution to the testing state distribution in feature space. Heusel et al. (2017) have shown that the Fréchet inception distance is consistent with human judgement on disturbed image data.
We additionally compute the squared distance of the nearest neighbour between generated states and the testing samples, either as the expectation over the generated states ${\stackrel{\mathrm{\u203e}}{d}}_{\mathrm{gen}}$ or as the expectation over the testing states ${\stackrel{\mathrm{\u203e}}{d}}_{\mathrm{test}}$. To evaluate rare events, we use a peakoverthreshold metric (POT). We use the 1st and 99th percentile from the testing dataset such that 2 % of the generated samples should lie on average below and above the lower and upper threshold, respectively.
4.1.1 Results
In our simplified formulation, changing the parameterization of the neural network output changes the loss function and the weighting of the Kullback–Leibler divergence, as explained in Sect. 2.4. The weighting is additionally influenced by the chosen noise scheduling, here either a linear scheduler or a cosine scheduler (Nichol and Dhariwal, 2021), both defined in Appendix C. In Table 1, we compare the output parameterizations and noise scheduler in terms of the resulting generative quality.
The noise ϵ and velocity v parameterizations result in the best scores. Additionally, a cosine noise scheduler improves almost all scores compared to a linear scheduler. During training (not shown), we have experienced that the velocity v parameterization is more stable and converges faster than the noise ϵ parameterization. These results confirm results from image generation, where a velocity v parameterization has been introduced to stabilize the training of the neural network (Salimans and Ho, 2022). Hence, we recommend a velocity v parameterization and a cosine noise scheduler as default combination for training DDMs. All following results are consequently derived based on this configuration (also marked as bold in Table 1).
Analysing the scores for the peakoverthreshold metric (POT), the DDMs are slightly underdispersive. However, the better the model, the better the coverage. Furthermore, the generated states visually cover the testing samples, in terms of twodimensional projections, as shown in Fig. 6a–f. Comparing the onedimensional marginal empirical probability density functions in Fig. 6g–i, the generated samples are almost indistinguishable from the true samples, even in their extreme values. Taking Table 1 and Fig. 6 into account, DDMs can generate state samples very similar to those drawn from the attractor of the dynamical system.
Since the denoising neural network must be iteratively applied, generating samples with DDMs can be slow, especially for highdimensional states. Trained with 1000 pseudotime steps, DDMs can generate samples by skipping steps to speed up the generation process. We evaluate the effect of fewer generation steps in Table 2, where the generation quality is measured in terms of Hellinger distance. As the DDIM sampling scheme has been introduced for data generation with fewer steps (J. Song et al., 2021), we additionally evaluate the impact of the additional noise during sampling.
For all sampling schemes, the quality of the generated samples improves with the number of pseudotime steps. However, the improvements between 100 pseudotime steps and 1000 pseudotime steps are small compared to differences from different output parameterizations and the noise schedulers. With a smaller computational budget to generate data, a DDM can generate data still with an acceptable quality but in fewer pseudotime steps.
Similar to the results found for image generation (J. Song et al., 2021), reducing the added noise during generation of the state samples can improve the quality of the generated samples for a smaller number of time steps than they are trained for. The deterministic DDIM with η=0.0 is the bestperforming sampling scheme for a smaller number of pseudotime steps than 50. However, when the full 1000 pseudotime steps are used for data generation, almost no differences are left between the different sampling schemes.
4.2 Surrogate modelling
In this next step, we finetune a DDM as a feature extractor for surrogate modelling. As a reminder, the model is pretrained with the velocity v parameterization and cosine noise scheduling. Based on the initial state x_{t} at time t, we want to predict the state x_{t+Δt} for a lead time Δt=0.1 MTU, a mildly nonlinear setting (Bocquet, 2011). We parameterize the surrogate modelling function $\stackrel{\mathrm{\u0303}}{\mathcal{M}}\left({\mathit{x}}_{t}\right)$ as an additive model, where the statistical model g_{θ}(x_{t}) with its parameters θ represents the residual,
In our transfer learning experiments, the statistical model works on top of features extracted by the pretrained network ϕ(z_{τ},τ), as explained in Sect. 3.1. We fix the pseudotime in the feature extractor and concatenate features from different pseudotime steps. We have three different pseudotime step settings, either a single step τ=[400], two pseudotime steps $\mathit{\tau}=[\mathrm{50},\mathrm{400}]$, or six steps $\mathit{\tau}=[\mathrm{0},\mathrm{50},\mathrm{100},\mathrm{200},\mathrm{400},\mathrm{600}]$. On top of the feature extractor, we train either a linear regression or a shallow neural network,
with its transposed weights W^{⊤} and ${\mathbf{W}}_{\mathrm{1}}^{\top}$ and biases β and β_{1}. The shallow neural network always has 256 hidden features.
We compare the transferlearned surrogate models to random Fourier features (RFFs; Rahimi and Recht, 2007) and neural networks trained from scratch. In the case of RFFs, we replace the pretrained feature extractor by either 256 or 1536 random Fourier features that approximate a Gaussian kernel as specified in Eq. (1) of Sutherland and Schneider (2015). These features can be seen as nonrecurrent instantiation of a random feature extractor, often used for machine learning in dynamical systems (e.g. Vlachas et al., 2020; Arcomano et al., 2020). For the neural network, we use two different architectures, a simple architecture where we stack $m\in [\mathrm{1},\mathrm{2},\mathrm{3}]$ fully connected layers with 256 neurons and rectified linear unit (relu) activation functions in between or a ResNetlike architecture with three residual blocks, as similarly specified in Appendix D1.
For all experiments, we optimize the statistical model for the meansquared error (MSE) between the true increment $\mathrm{\Delta}{\mathit{x}}_{t+\mathrm{\Delta}t}={\mathit{x}}_{t+\mathrm{\Delta}t}{\mathit{x}}_{t}$ and the output of the neural network. The linear regression models are analytically estimated as an L_{2}regularized leastsquares solution, also called ridge regression. The L_{2} regularization parameter is held constant ($\mathit{\lambda}=\mathrm{1}\times {\mathrm{10}}^{\mathrm{4}}$) to simplify the training and the comparison between experiments. All neural network models are trained with the Adam optimizer ($\mathit{\gamma}=\mathrm{5}\times {\mathrm{10}}^{\mathrm{5}}$) for 100 epochs with a batch size of 16 384. We refrain from learning rate scheduling or early stopping for the ease of comparison.
To evaluate the surrogate models and their stability for longer lead times than Δt=0.1 MTU, we use the forecast as initial conditions for the next iteration, iterating for k times to cover a lead time of k⋅Δt. These trajectories are compared to the trajectories in the testing dataset in terms of rootmeansquared error, normalized by the state climatology in the training dataset. Additionally, to estimate the quality of the resulting climatology, we compare the Hellinger distance of the predictions between 5 and 10 MTU to the testing dataset. We estimate the Hellinger distance as in Sect. 4.1.
4.2.1 Results
In Table 3, we evaluate the transferlearned surrogate models for a lead time of Δt=0.1 MTU and Δt=1 MTU, which corresponds to 1 iterations or 10 iterations, respectively, and the Hellinger distance H. Additionally, we compare this model to other surrogate models, learned from scratch.
All transferlearned models have a predictive power that reaches beyond one model time unit. The performance of the pretrained models is unreachable for untrained feature extractors, showing the added value of pretraining as DDMs. Furthermore, the transferlearned models can outperform random Fourier feature (RFF) networks and perform similarly to NNs trained from scratch.
Extracting features at multiple pseudotime steps can strengthen transfer learning, and such models can improve on transferlearned models with only τ=400 as the pseudotime step for the feature extraction. The advantage of using multiple time steps is especially evident for models with a linear regression, while only small differences exist for a shallow NN with 256 hidden neurons after the feature extraction. On the one hand, the shallow NN can nonlinearly combine the extracted features, which seem to help in the case with a single time step. On the other hand, an increasing number of extracted features results in an increased collinearity between features and a more unstable training, as shown by the increased standard deviation in the case of the Transfer (6×τ, NN) experiment. The L_{2} regularization of the linear regression reduces the feature collinearity, such that Transfer (6×τ, linear) performs better than the Transfer (τ=400, linear) model. Consequently, features from multiple time steps increase the performance for the linear regression case, whereas finetuning with a neural network can lead to better results, if the collinearity has been taken care of, for example, by restricting the number of pseudotime steps.
Feature extraction with random Fourier features (RFFs) needs a high number of features to perform well, even for the threedimensional Lorenz 1963 system. Hence, the transferlearned models with features from a single pseudotime steps outperform RFFs for the same number of features, here 256. While a higher number of RFFs can initially outperform transferlearned models, their performance and stability for longer lead times heavily depend on the drawn random weights. Contrastingly, the transferlearned model is stable for all tested integration time steps, as also shown in Fig. 7. The transferlearned models converge towards a climatological forecast, as can also be seen in the performance of the Transfer (untrained, 6×τ, linear) model, whereas RFFbased models diverge. Additionally, as the needed number of RFFs scales with the data dimensionality, transfer learning can be preferable for higherdimensional problems than the Lorenz 1963 system. Transfer learning can outperform RFFs, especially with regard to the longterm stability of the model.
The best transferlearned model (2×τ, NN) performs on par with the best NN trained from scratch (ResNet), while the transferlearned models with a shallow NN can outperform other NNs learned from scratch for shorter lead times. Note that the NNs are trained with a fixed learning rate, and the results for the neural networks trained from scratch may indicate convergence issues. Since the models are trained for lead times of 0.1 MTU without autoregressive steps, their performance for longer lead times is impacted by randomness as shown by the spread between difference seeds in Fig. 7. Compared to this spread, the transferlearned models with a shallow NN and the NNs trained from scratch perform similarly.
To see if a better generative score translates into better surrogate models, we compare the generative Hellinger distance to the RMSE of the surrogate model, transferlearned with a linear regression for different output parameterizations in Table 4. In general, the ordering between the output parameterizations remains more or less the same for surrogate modelling as for data generation; the state parameterization has the worst performance, whereas the noise and velocity parameterization have a performance similar to each other. For the noise and velocity parameterization, pretraining with cosine noise scheduling performs better than with a linear scheduling.
The differences caused by different random seeds are smaller than the differences between different parameterizations and the noise scheduler; experiments with a cosine noise scheduler generally result in lower standard deviations between seeds. Therefore, the better the generative model, the better the NN can be finetuned towards surrogate modelling. Pretraining with a velocity parameterization hereby results in the best surrogate modelling performance.
4.3 Data assimilation with a generated ensemble
We test how the latent states in the DDM can be used for ensemble generation in a data assimilation setup. We define the long trajectories from the testing dataset as our truth ${\mathit{x}}_{\mathrm{1}:{t}_{\mathrm{end}}}^{\mathrm{t}}$, going from time t=1 to time t=t_{end}. In our experiments, the observation operator is given as the identity matrix H=I; all three states are observed. The observations ${\mathit{y}}_{t}^{\mathrm{o}}$ at time t are perturbed with white noise drawn from a Gaussian distribution, with a prespecified observation standard deviation σ^{o},
If not differently specified, we set the observation standard deviation to σ^{o}=2 and the time interval between observations to Δt=0.1 MTU.
As data assimilation algorithm, we use an ensemble transform Kalman filter (ETKF; Bishop et al., 2001; Hunt et al., 2007). In cases where the ensemble is externally generated, we modify the ETKF as proposed by Schraff et al. (2016) to update the deterministic background forecast.
In the DDM experiment, based on a deterministic run, we draw 50 ensemble members with the DDM as proposed in Sect. 3.2. We use the pretrained neural network with a velocity output parameterization, a cosine noise schedule, and a single random seed (seed=0) to generate the ensemble. The denoising diffusion model is additionally defined by its sampling scheme, the number of maximum pseudotime steps T until the prior distribution is reached, and the signal magnitude α_{τ} of the partial diffusion. We sample in the denoising process with a DDPM scheme and set the number of maximum pseudotime steps to T=100, reducing the computational needs. The only parameter in the ensemble generation is consequently the signal magnitude α_{τ}, which we tune for each experiment independently.
We compare the proposed ensemble generation methodology to a full ETKF, ensemble optimal interpolation (EnOI), and 3DVar. The full ETKF is the reference and includes flowdependent covariances, a feature missing in the proposed ensemble generation method. The EnOI experiments define a baseline with static covariance matrices. As we have to generate an external ensemble in the EnOI experiments, we induce sampling errors in the data assimilation. The 3DVar experiments use the same covariances as the EnOI but analytically solve the Kalman filter equation without sampling.
We run the full ETKF with 3 or 12 ensemble members and an optimally tuned multiplicative prior covariance inflation. Here, the ensemble mean specifies the deterministic forecast. The ETKF estimates one firstguess covariance matrix ${\mathbf{P}}_{t}^{\mathrm{b}}$ per forecast. To define the background covariances in our EnOI experiments, we use the firstguess covariance matrices from the 12member ETKF run with an update time delta of Δt=0.1 MTU, averaged over the full trajectory with S steps and inflated by a tuning factor α,
The full covariance matrix is specified in Appendix D2. Besides a full covariance, we also specify a diagonal covariance, an often used approximation in EnOI. With the background covariances defined this way, per update step we draw 50 ensemble members (for a fair comparison to the DDM experiments), which are centred around the deterministic run and then used to update the deterministic run.
We initialize all experiments with states randomly drawn from the climatology of the testing dataset. Each experiment has 55 000 update cycles, where we omit the first 5000 updates as the burnin phase. We repeat each experiment 16 times with different random seeds. This way, each batch of experiments has 8×10^{5} analyses. The parameters for each batch of experiments, the signal amplitude α_{τ} for the partial diffusion in the DDPM scheme, the prior covariance inflation ρ for the ETKF, and the covariance inflation α for the EnOI, are tuned to give the lowest timeaveraged analysis RMSE, averaged over the 16 repetitions.
We compare the analyses to the true trajectories in terms of RMSE at analysis time, normalized with respect to the climatology. Additionally, we run forecasts based on the analyses and compare them to the true trajectories to see the impact on longer lead times.
4.3.1 Results
We compare the analysis and background normalized RMSE to a tuned ensemble Kalman filter (ETKF), ensemble optimal interpolation (EnOI), and 3DVar in Table 5. The time between updates is Δt=0.1 MTU, a mildly nonlinear case (Bocquet, 2011).
The ensemble members generated by a DDM can be used to assimilate observations into a deterministic forecast with a EnOIlike scheme. Data assimilation with DDM results in a slight improvement compared to EnOI with full covariances and matches 3DVar; it hereby surpasses the performance of EnOI with a diagonal covariance, as often employed in EnOI. As the DDM has just one tuning parameter, the signal magnitude α_{τ}, this type of ensemble generation can provide a simplified way for ensemble data assimilation algorithms, needing less tuning than EnOI.
The DDM is statedependent, which can explain the small advantage compared to EnOI with the same number of ensemble members. However, as an ensemble Kalman filter provides additionally a flow dependency, it performs much better than EnOI, 3DVar, and the DDM. Consequently, to match the performance of the ensemble Kalman filter with the DDM, we need to incorporate flow information into the DDM. Nevertheless, pretrained for state generation, DDMs can be a cheap alternative to generating an ensemble for data assimilation purposes.
In Fig. 8, we show the influence of the signal magnitude on the background RMSE and the spread of the generated ensemble.
The bigger the pseudotime step, the smaller the signal magnitude, and the more diffused is the deterministic run during the partial diffusion, which controls the degree of uncertainty in the ensemble. For a very small pseudotime step with a signal magnitude near 1, α_{τ}≈1, almost no noise would be added, and we would end up with a (too) small ensemble spread. For a large pseudotime step with a signal magnitude near zero, α_{τ}≈0, almost all data would be replaced by noise in the latent state; the generated ensemble would correspond to a climatological ensemble with a (too) large ensemble spread. Similar to an ensemble data assimilation system, the lowest RMSE is reached when the ensemble spread roughly matches the RMSE. Consequently, the choice of the pseudotime step is similar to the covariance inflation factor in an ensemble data assimilation system.
Until now, we have shown results for a single time delta between two update times with Δt=0.1 MTU. In Fig. 9, we show results for varying this time delta for the ETKF, the EnOI with static covariances, and EnOI with the DDM sampler; all experiments are tuned for each time delta independently.
Increasing the time between data assimilation updates increases the nonlinearity of the system. For all tested time deltas, EnOI with a DDM performs in the analysis RMSE slightly better than EnOI with static covariances. Although the performance of the ETKF is unreachable for any update time, the gain of DDMs compared to static covariances increases with increasing nonlinearity of the system.
This gain is a result of the nonGaussian distribution of the generated ensemble members for heavily nonlinear state propagation, as can be seen in Fig. 10. The latent state obtained after the diffusion process is purely Gaussiandistributed by definition; see also Eq. (4). However, the iterative denoising process is statedependent, which can result in nonGaussian ensemble distributions. The larger the pseudotime step for the ensemble generation, the larger the sampled portion of the attractor and the more nonGaussian the distribution in the denoising process can get. Therefore, tuning the pseudotime step of the DDM allows us to tune not only its ensemble spread but also the sampled portion of the attractor.
In this study, we investigate unconditional denoising diffusion models (DDMs) for representation learning in dynamical systems. We train such models on the task of state generation in the Lorenz 1963 model. Using a large dataset of state and deep residual neural networks (NNs), we test settings that are almost unconstrained from the dataset size or the NN capacity. In these settings, the DDM can generate states that are almost indistinguishable from states drawn from the model's attractor. Our results for state generation correspond to those found for image generation. Predicting the noise performs better than directly predicting the state, as similarly found by Ho et al. (2020). Across all tested settings, cosine noise scheduling is superior to linear noise scheduling (Nichol and Dhariwal, 2021). Furthermore, we obtain the most stable results for predicting a velocity v, as discussed by Salimans and Ho (2022). For a few generation steps, using deterministic sampling with denoising diffusion implicit models (DDIM) outperforms its stochastic counterparts (J. Song et al., 2021). In general, results from image generation and improvements therein seem transferable to state generation for dynamical systems.
We can approximate the state distribution of the Lorenz 1963 system by state quantization and estimation of a threedimensional empirical probability density function (PDF). We can consequently evaluate generative models by comparing the empirical PDF of states generated with the model to states drawn from the attractor. However, the estimation of a multivariate empirical PDF is unfeasible for higherdimensional systems.
To circumvent such problems, we adapt the Fréchet inception distance (FID) to geoscientific settings. This distance compares the generated states to the real states with the Fréchet distance in the feature space of a pretrained NN. We replace the commonly used inception network (Szegedy et al., 2014) by a NN pretrained for surrogate modelling in the Lorenz 1963 system. The ordering of different generative parameterizations in this adapted metric is very similar to the ordering found by estimating the Hellinger distance between the empirical PDFs. An adapted FID can be a good metric for evaluating generative models in highdimensional geoscientific settings. One of the remaining challenges with this method is the choice of an appropriate pretrained NN to estimate the Fréchet distance.
At a first glance, unconditional DDMs, trained for state generation, have a smaller application range compared to their conditional counterpart. Here, we finetune the unconditional denoising NN for surrogate modelling and apply the full unconditional DDM for ensemble generation.
By removing its last layer, we can use the denoising NN as a feature extractor for surrogate modelling. Our results indicate that the NN learns general features about the dynamical system, when pretrained for state generation. The extracted features depend on the pseudotime step, with more complex features for smaller steps. Consequently, by combining features from different pseudotime steps, we use more information from the feature extractor.
Although the DDM has previously never seen information about the temporal dynamics of the dynamical system, we can finetune the denoising NN for surrogate modelling. By regressing features from a single pseudotime step, the finetuned network performs better than random Fourier features with the same number of extracted features. This suggests that for higherdimensional problems, pretrained denoising NNs may perform much better than random Fourier features as a feature extractor.
By learning a shallow NN on top of the extracted features, the finetuned network achieves scores similar to deep NNs. As Lorenz 1963 is a lowdimensional system, where NNs can almost perfectly predict the temporal dynamics, a NN trained from scratch can be difficult to beat. However, for highdimensional systems, where we might only have a few training samples, training a deep NN from scratch might be infeasible. To pretrain a DDM, we can use large, heterogeneous datasets and, then, finetune the NN on small problemspecific datasets. Our encouraging results for lowdimensional settings indicate this potential for transfer learning of DDMs and their use as pretrained feature extractor.
Beside surrogate modelling, we apply the DDM for ensemble generation in a data assimilation. By diffusing and denoising, we generate an ensemble out of a deterministic run. The ensemble can define the prior distribution for an ensemble optimal interpolation (EnOI) scheme to assimilate observations into a deterministic forecast. Such a data assimilation with a DDM as the sampler performs at least as well as EnOI with static but tuned covariances. The ensemble generated with the DDM inherits the state dependency of the denoising NN. As a result, the more nonlinear the system, the larger the gain of the DDM sampling can get compared to static covariances in EnOI.
Data assimilation with a propagated ensemble profits from its state and flow dependency. Since the DDM is only trained for state generation, its attractor is only defined in state space. The time dimension and, hence, one of the requirements for flow dependency are missing. Consequently, the performance of a tuned ensemble Kalman filter is unreachable for EnOI with a DDM sampler. To make the sampler flowdependent, we must incorporate the time dimension. Instead of generating states static in time, the DDM could learn to generate small trajectories. By partially diffusing the forecast trajectory, we can inform the sampler about the temporal development and, thus, generate a flowdependent ensemble.
The proposed ensemble sampling could additionally augment an ensemble by additional ensemble members. This augmentation would be similar to the use of a climatological augmented ensemble (Kretschmer et al., 2015). The climatological ensemble members would be replaced by ensemble members generated with the DDM. This can be additionally seen like ensemble data assimilation with a hybrid covariance matrix (Hamill and Snyder, 2000; Lorenc, 2003), where the covariance is a weighted average between the original ensemble covariance and the covariance of the generated members. On the one hand, the original ensemble members would bring the flow dependency into ensemble. On the other hand, augmenting the ensemble by generated members could be a way to reduce the need of inflation and localization in an ensemble data assimilation system.
The application of pretrained unconditional DDMs for surrogate modelling and ensemble generation indicates their potential for geoscientific problems. Trained to sample from the attractor, the model learns an internal representation, then applicable in downstream tasks. The combination of DDMs with data assimilation could additionally be a way to learn such deep generative models from combining observations with a geophysical model. Using such a combination, DDMs could possibly learn a representation of the true Earth system's attractor. This representation might be then helpful for largescale applications like model error corrections (e.g. Bonavita and Laloyaux, 2020; Farchi et al., 2021; Chen et al., 2022; Finn et al., 2023) or digital twins (e.g. Bauer et al., 2021a, b; Latif, 2022; Li et al., 2023b).
In this study, we use the threedimensional Lorenz 1963 system exclusively, and it remains unknown whether the proposed methods are applicable to higherdimensional systems. However, DDMs have demonstrated scalability to global scales, such as in weather prediction (Price et al., 2024) and the generation of new ensemble members from existing ones (Li et al., 2023a). Furthermore, convolutional NNs (e.g. Dhariwal and Nichol, 2021; Rombach et al., 2022) and transformers (Peebles and Xie, 2023) are commonly applied for image generation with DDMs. This suggests that the proposed methods might also be effective in higherdimensional contexts and for other NN architectures. Therefore, based on our results and the demonstrated scalability of DDMs, we see potential for using representation learning with DDMs in higherdimensional geophysical systems.
In this paper, we investigate the capabilities of denoising diffusion models for representation learning in dynamical systems. Based on our results with the Lorenz 1963 model, we conclude the following:

Denoising diffusion models can be trained to generate states on the attractor of the dynamical system. Using a large training dataset and a residual neural network, the generated states are almost indistinguishable from states drawn from the true attractor. To achieve a stable training for dynamical systems, we can recommend denoising diffusion models with a velocity v output parameterization and a cosine noise scheduler. Similar to results for image generation, the deterministic DDIM sampling scheme works best for few pseudotime steps.

Denoising diffusion models can be finetuned for downstream tasks by applying the denoising neural network as a feature extractor and retraining its last layer. The features extracted by the denoising network depend on the pseudotime step used, with more complex features for smaller steps. Combining features at different pseudotime steps, we can empower the feature extractor for downstream tasks. A betterperforming generative model can hereby also achieve better scores in downstream tasks.

Pretrained as denoising diffusion models for state generation, neural networks can be transferlearned for surrogate modelling. Their performance in these lowdimensional settings is similar to the deep neural network trained from scratch. Training neural networks as denoising diffusion models therefore has the potential for the largescale pretraining of deep neural networks for geoscientific problems.

The pretrained denoising diffusion model can be applied to generate an ensemble out of a deterministic run. By partial diffusion and denoising with the neural network, we can sample from the attractor in the run's surrounding. As a tuning parameter, we can choose the number of diffusion steps, which controls the portion of the sampled attractor and the resulting ensemble spread. Since the denoising network is trained for static state generation, the generated ensemble is statedependent but lacks flow dependency. To introduce such a flow dependency, the denoising diffusion model must also be trained with timedependent states, for example, by training to generate trajectories.

The ensemble generated with a pretrained denoising diffusion model can define the prior distribution for ensemble optimal interpolation to assimilate observations into a deterministic forecast. Data assimilation with this sampler can outperform ensemble optimal interpolation with tuned climatological covariances. The more nonlinear the dynamical system, the larger the gain of sampling a denoising diffusion model can get compared to static covariances.
The diffusion process progressively replaces the signal by noise, and the denoising NN is trained to invert this process and can be iteratively used to generate a signal out of pure noise. In our case, we train the NN to generate states that should lie on the attractor of a dynamical system. The process of generating a signal out of pure noise resembles the spinup procedure often used for dynamical systems.
To spinup dynamical systems, random fields are generated that lie in the basin of attraction for the dynamical system. If these randomly sampled states are integrated with the dynamical system for enough steps, all states from the basin of attraction converge to the attractor of the dynamical system.
In fact, the diffusion process corresponds to an Itô stochastic differential equation (SDE; Y. Song et al., 2021), integrated in pseudotime,
where m(z,τ) is the drift coefficient, g(τ) is the diffusion coefficient, dt is an infinitesimalsmall pseudotime step, and dw defines a Wiener (Brownian) process. We can use ancestral sampling to integrate this SDE, going from the state distribution p(x)=p(z_{0}) at t=0 to the prior distribution p(z_{T}), Eq. (5), at t=T. Defining $\text{m}(\mathit{z},\mathit{\tau})=\frac{\mathrm{1}}{\mathrm{2}}\mathit{\beta}\left(\mathit{\tau}\right)\mathit{z}$ and $\text{g}\left(\mathit{\tau}\right)=\sqrt{\mathit{\beta}\left(\mathit{\tau}\right)}$ with noise scales β(τ), the variancepreserving diffusion process is recovered (Y. Song et al., 2021).
Inverting the process, we start at pseudotime t=T, with pure noise drawn from the prior distribution p(z_{T}), and move towards the state distribution target p(z_{0}). The inverse of a diffusion process is again a diffusion process, resulting in the reversetime SDE,
with $d\stackrel{\mathrm{\u0303}}{\mathit{w}}$ as the reversetime Wiener process. Remarkably, the reversetime SDE, Eq. (A2), is defined by the known drift and diffusion coefficients and the socalled score function ∇_{z}log p(z), the gradient that points towards the datagenerating distribution. Consequently, once the score function is known for all pseudotime steps, we can use ancestral sampling to integrate the denoising process, Eq. (A2), and to generate samples based on noise drawn from the predefined prior distribution.
Similarly to predicting the noise, the state, or an angular velocity, the score can be approximated by a NN, ${s}_{\mathit{\theta}}({\mathit{z}}_{\mathit{\tau}},\mathit{\tau})\approx {\mathrm{\nabla}}_{{\mathit{z}}_{\mathit{\tau}}}\mathrm{log}p\left({\mathit{z}}_{\mathit{\tau}}\right)$. Predicting the noise is hereby proportional to the approximated score function by the relation
where the predicted noise points away from datagenerating distribution. Consequently, training the NN to predict the noise, Eq. (13), is equivalent to score matching (Hyvärinen, 2005; Vincent, 2011; Song et al., 2019).
The simplest method to integrate the SDEs is using an Euler–Maruyama integration with a fixed step size. This leads to similar procedures to those specified in Sect. 2. Consequently, sampling with the DDPM or DDIM scheme, as specified by Eqs. (14) and (15a), corresponds to special discretizations of the reversetime SDE (J. Song et al., 2021; Y. Song et al., 2021), defined in Eq. (A2). However, the formulation of the diffusion process as SDE allows us to use different integration methods with adaptive step sizes (JolicoeurMartineau et al., 2022; Dockhorn et al., 2022; Lu et al., 2022) such that fewer integration steps than with the DDPM scheme are needed to generate data.
The generation of states with the SDE is a sort of dynamical system, integrated in pseudotime. The smaller the integration error, the smaller the approximation error, and the larger the number of training samples, the smaller the error between the distribution of generated states and the datagenerating distribution (De Bortoli, 2022), with convergence in its limit. Therefore, if the NN is trained on samples that lie on the attractor of the dynamical system, the generated samples will also lie on this attractor in these limits.
In the Bayesian formalism of data assimilation, we want to find the posterior distribution $p({\mathit{x}}_{t}\mid {\mathit{y}}_{\mathrm{1}:t})$ of the current state x_{t} at time t based on all observations y_{1:t} up to the very same time. We can use Bayes' theorem to split the posterior distribution into a prior distribution $p({\mathit{x}}_{t}\mid {\mathit{y}}_{\mathrm{1}:t\mathrm{1}})$ and the observation likelihood p(y_{t}∣x_{t}),
The influence of past observations ${\mathit{y}}_{\mathrm{1}:t\mathrm{1}}$ on the current state is hence encoded into the prior distribution. In data assimilation, we often estimate a deterministic analysis ${\mathit{x}}_{t}^{\mathrm{a}}$ as the single best estimate of the posterior. Having access to a geophysical forecast model ℳ(⋅), we can generate a deterministic model forecast ${\mathit{x}}_{t}^{\mathrm{f}}$ based on the analysis from the previous time ${\mathit{x}}_{t}^{\mathrm{f}}=\mathcal{M}\left({\mathit{x}}_{t\mathrm{1}}^{\mathrm{a}}\right)$. Using such a model forecast as a basis, we can generate an ensemble with a denoising diffusion model, as introduced in Sect. 3.2. This generated ensemble can be seen as specifying the prior distribution in the Bayes' theorem (Eq. B1),
The tuning factor for the prior distribution is the number of partial diffusion steps τ and the signal magnitude α_{τ}. We can sample from the conditional state distribution given the diffused states $p({\mathit{x}}_{t}\mid {\mathit{z}}_{\mathit{\tau},t})$ by denoising the diffused states. By partially diffusing the deterministic forecast for τ steps, sampling from the diffused distribution, and denoising with the neural network, we can thus specify the prior distribution for data assimilation.
We test two different noise schedulers, a linear scheduler and a cosine scheduler. The resulting signal and noise amplitude are shown in Fig. C1:

The linear scheduler linearly increases the relative noise magnitude β_{τ} (Ho et al., 2020). The relative noise magnitude then specifies the relative signal magnitude ${\mathit{\alpha}}_{\mathit{\tau}}^{\prime}={\mathit{\beta}}_{\mathit{\tau}}$, used in Eq. (3). We linearly increase β_{τ} from 0.0001 for τ=0 to 0.03 for $\mathit{\tau}=T=\mathrm{1000}$. The signal and noise amplitude are shown as blue lines in Fig. C1.

The cosine scheduler defines the signal amplitude ${\mathit{\alpha}}_{\mathit{\tau}}=\mathrm{cos}\left(\frac{\mathit{\pi}}{\mathrm{2}}\frac{{\mathit{\tau}}^{\prime}+s}{\mathrm{1}+s}\right)$ as a shifted cosine function with shift s=0.008 and ${\mathit{\tau}}^{\prime}=\frac{\mathit{\tau}}{\mathrm{1000}}$ (Nichol and Dhariwal, 2021). This signal amplitude is directly used in Eq. (4) and shown alongside the noise amplitude as red lines in Fig. C1.
The signal amplitude in the cosine scheduler decays slower than in the linear scheduler. Generating with a cosine scheduler is consequently more concentrated in a high signaltonoise ratio regime, which results in a better state generation.
D1 Denoising network
As a denoising neural network, we use a fully connected residual neural network. We use a linear layer, mapping from the threedimensional state vector to 256 features. Concurrently, we apply a sinusoidal time embedding (Vaswani et al., 2017), where we map the pseudotime information into 128 sinusoidal features with increasing wavelengths. After these initial mappings, we apply three residual blocks.
Each residual block has a branch. In the branch, the data are normalized by layer normalization. The affine transformations, applied after the normalization, are linearly conditioned on the embedding (Perez et al., 2017). After normalizing and modulating, we apply a shallow neural network with 256 features and a rectified linear unit (relu) as activation function. The output of the branch is added to the input to the residual block.
After the three residual blocks, we linearly combine the extracted features to get the output of the neural network. Depending on its parameterization (see also Sect. 2.4), the output has different meanings. In total, the denoising neural network has 1.2×10^{6} parameters.
D2 Background covariance matrix
The ensemble interpolation and 3DVar experiments in Sect. 4.3 use a static background covariance matrix. The matrix is based on averaged ensemble covariances from a tuned ensemble transform Kalman filter with 12 ensemble members. Scaled by a tuning factor α, the matrix is tuned for each experiment independently. The unscaled background covariance matrix is specified in Table D1.
The scores are estimated based on two different distributions: q, the generated sample distribution, and p, the testing sample distribution. In total, we have five different metrics:

The Hellinger distance is estimated between the empirical probability density functions of q and p. For the estimation of the Hellinger distance, we discretize the threedimensional states into k cubes (Scher and Messori, 2019) and count the probability for q and p that a sample lies in the ith cube. In slight abuse of notation, we denote q_{i} and p_{i} as probabilities for q and p, respectively.
$$\begin{array}{}\text{(E1)}& H(q,p)={\displaystyle \frac{\mathrm{1}}{\sqrt{\mathrm{2}}}}\sqrt{\sum _{i=\mathrm{1}}^{k}{\left(\sqrt{{q}_{i}}\sqrt{{p}_{i}}\right)}^{\mathrm{2}}}\end{array}$$The distance should be 0 if the distributions perfectly correspond.

The Fréchet surrogate distance measures the distance between the distributions in a feature space Aφ(x) of an independent neural network, trained here for surrogate modelling. For both distributions, the means, μ_{q} and μ_{p}, and covariances, Σ_{q} and Σ_{p} in feature space are estimated. The Fréchet distance is then given as the 2Wasserstein distance with a multivariate Gaussian assumption.
$$\begin{array}{}\text{(E2)}& \begin{array}{rl}& \mathrm{FSD}\left(\mathcal{N}\right({\mathit{\mu}}_{q},{\mathbf{\Sigma}}_{q}),\mathcal{N}({\mathit{\mu}}_{p},{\mathbf{\Sigma}}_{p}\left)\right)=\Vert {\mathit{\mu}}_{q}{\mathit{\mu}}_{p}{\Vert}_{\mathrm{2}}^{\mathrm{2}}\\ & \phantom{\rule{1em}{0ex}}+\mathrm{Tr}\left({\mathbf{\Sigma}}_{q}+{\mathbf{\Sigma}}_{p}\mathrm{2}({\mathbf{\Sigma}}_{q}^{{\scriptscriptstyle \frac{\mathrm{1}}{\mathrm{2}}}}{\mathbf{\Sigma}}_{p}{\mathbf{\Sigma}}_{q}^{{\scriptscriptstyle \frac{\mathrm{1}}{\mathrm{2}}}}{)}^{{\scriptscriptstyle \frac{\mathrm{1}}{\mathrm{2}}}}\right)\end{array}\end{array}$$The distance should be 0 if the distributions perfectly correspond.

The meansquared distance from the generated samples to the nearest test sample measures the closeness of the test data to the distribution from the generated data. This metric is dependent on the number of training samples, as the implicit representation of the distribution depends on the number of samples.
$$\begin{array}{}\text{(E3)}& {\stackrel{\mathrm{\u203e}}{d}}_{\mathrm{gen}}(q,p)={\mathbb{E}}_{\mathit{x}\sim q}\left[\underset{{\mathit{z}}_{\mathrm{0}}\sim p}{min}\Vert \mathit{x}{\mathit{z}}_{\mathrm{0}}{\Vert}_{\mathrm{2}}^{\mathrm{2}}\right]\end{array}$$The distance should be 0 if the distributions perfectly correspond.

The meansquared distance from the test samples to the nearest generated sample measures the closeness of the generated data to the attractor of the dynamical system, represented by the test data. This metric is independent of the number of generated samples, as there is always the same number of test samples.
$$\begin{array}{}\text{(E4)}& {\stackrel{\mathrm{\u203e}}{d}}_{\mathrm{test}}(q,p)={\mathbb{E}}_{{\mathit{z}}_{\mathrm{0}}\sim p}\left[\underset{\mathit{x}\sim q}{min}\Vert \mathit{x}{\mathit{z}}_{\mathrm{0}}{\Vert}_{\mathrm{2}}^{\mathrm{2}}\right]\end{array}$$The distance should be 0 if the distributions perfectly correspond.

As a measure for rare events, we use a peakoverthreshold or peakunderthreshold metric. We determine the 1st (p_{0.01}) and 99th (p_{0.99}) percentiles along the three dimensions in the testing data and then measure how often events below (1st) or above (99th) these percentiles occur in the generated samples.
$$\begin{array}{}\text{(E5)}& \mathrm{POT}(q,p)={\mathbb{E}}_{\mathit{x}\sim q}\left[P(\mathit{x}\le {p}_{\mathrm{0.01}})+P(\mathit{x}\ge {p}_{\mathrm{0.99}})\right]\end{array}$$If the extremes are correctly represented, the expected value should be 0.02.
To determine if the denoising neural network can extract features about the dynamical system, we can fit linear models from known polynomial features to the features of the neural network. First, we extract features from states in the training dataset with the denoising neural network. Similarly to the surrogate modelling experiments (Sect. 4.2), we use six different pseudotime steps. Secondly, we linearly regress from known polynomial features (x, y, z, xy, xz) to the extracted features, giving us the input coefficients β_{in}. Thirdly, we linearly regress from the extracted features to the derivatives estimated with the Lorenz 1963 equations, Eqs. (19a)–(19c), which gives us the output coefficients β_{out}. Fourthly, we multiply the input coefficients β_{in} by the output coefficients β_{out}, resulting in a linear factor table β.
The polynomial features and their linear factors are welldefined by the Lorenz 1963 equations, Eqs. (19a)–(19c). Consequently, if the feature extractor has learned some meaningful features about the dynamical system, the linear factor table β should recover the original factors from the model. We compare two feature extractors, an untrained extractor, where the weights are randomly initialized, and a feature extractor, pretrained as a denoising diffusion model (Table F1). The linear factors estimated with the pretrained diffusion model fit those from the original Lorenz equations almost perfectly. By contrast, a random neural network is unable to extract such features. This indicates that pretraining denoising diffusion models for state generation can learn useful features about the dynamical system itself.
The combined linear coefficients β use all extracted features of the neural network. In Fig. F1, we show the feature that has the highest correlation (r=0.78) to the x⋅y product. We can see a nonlinear dependency between the extracted feature and the product. Consequently, although the product is used to estimate the time derivative of z, there is no single feature that linearly represents this product. Therefore, the features that represent the dynamical system are entangled in the features, as represented by the feature extractor.
The source code for the experiments and the neural networks is publicly available under https://github.com/cereadaml/ddmattractor (last access: 17 September 2024) and Zenodo (https://doi.org/10.5281/zenodo.8406184, Finn, 2023). With the source code, the Lorenz 1963 data needed for the experiments can be reproduced. Exemplary weights for one diffusion model are additionally contained in the repository. The authors will provide further access to the weights of the neural networks upon request.
TSF had the original research idea. TSF, CD, AF, and MB refined this idea. In discussions with LD, TSF found the idea for ensemble generation. TSF, LD, CD, AF, and MB analysed the results. TSF wrote the manuscript, and LD, CD, AF, and MB reviewed it.
The contact author has declared that none of the authors has any competing interests.
Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.
The authors would like to thank Sibo Cheng and an anonymous referee for helpful comments that improved the manuscript. CEREA is a member of the Institut Pierre Simon Laplace (IPSL).
The authors received financial support from the project GenD2M funded by the LEFEMANU programme of the INSU/CNRS and from the project SASIP (grant no. 353) supported by Schmidt Science – a philanthropic initiative that seeks to improve societal outcomes through the development of emerging science and technologies. This work was granted access to the HPC resources of IDRIS under the allocations 2022AD011013069R1 and 2023AD011013069R2 made by GENCI.
This paper was edited by Ioulia Tchiguirinskaia and reviewed by Sibo Cheng and one anonymous referee.
Alain, G. and Bengio, Y.: What Regularized AutoEncoders Learn from the DataGenerating Distribution, J. Mach. Learn. Res., 15, 3563–3593, 2014. a
Arcomano, T., Szunyogh, I., Pathak, J., Wikner, A., Hunt, B. R., and Ott, E.: A Machine LearningBased Global Atmospheric Forecast Model, Geophys. Res. Lett., 47, e2020GL087776, https://doi.org/10.1029/2020GL087776, 2020. a
Arnold, H. M., Moroz, I. M., and Palmer, T. N.: Stochastic Parametrizations and Model Uncertainty in the Lorenz '96 System, Philos. T. Roy. Soc. A, 371, 20110479, https://doi.org/10.1098/rsta.2011.0479, 2013. a
Bao, F., Zhang, Z., and Zhang, G.: A Scorebased Nonlinear Filter for Data Assimilation, arXiv [preprint], https://doi.org/10.48550/arXiv.2306.09282, 2023. a
Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., and Babenko, A.: LabelEfficient Semantic Segmentation with Diffusion Models, arXiv [preprint], https://doi.org/10.48550/arXiv.2112.03126, 2022. a
Bauer, P., Thorpe, A., and Brunet, G.: The Quiet Revolution of Numerical Weather Prediction, Nature, 525, 47–55, https://doi.org/10.1038/nature14956, 2015. a
Bauer, P., Dueben, P. D., Hoefler, T., Quintino, T., Schulthess, T. C., and Wedi, N. P.: The Digital Revolution of Earthsystem Science, Nat. Comput. Sci., 1, 104–113, https://doi.org/10.1038/s43588021000230, 2021a. a
Bauer, P., Stevens, B., and Hazeleger, W.: A Digital Twin of Earth for the Green Transition, Nat. Clim. Change, 11, 80–83, https://doi.org/10.1038/s4155802100986y, 2021b. a
Bengio, Y., Yao, L., Alain, G., and Vincent, P.: Generalized Denoising AutoEncoders as Generative Models, in: Advances in Neural Information Processing Systems, vol. 26, Curran Associates, Inc., ISBN 9781713845393, 2013. a
Bishop, C. H., Etherton, B. J., and Majumdar, S. J.: Adaptive Sampling with the Ensemble Transform Kalman Filter. Part I: Theoretical Aspects, Mon. Weather Rev., 129, 420–436, https://doi.org/10.1175/15200493(2001)129<0420:ASWTET>2.0.CO;2, 2001. a
Bocquet, M.: Ensemble Kalman filtering without the intrinsic need for inflation, Nonlin. Processes Geophys., 18, 735–750, https://doi.org/10.5194/npg187352011, 2011. a, b
Bonavita, M. and Laloyaux, P.: Machine Learning for Model Error Inference and Correction, J. Adv. Model. Earth Sy., 12, e2020MS002232, https://doi.org/10.1029/2020MS002232, 2020. a
Bortoli, V. D., Thornton, J., Heng, J., and Doucet, A.: Diffusion schrödinger bridge with applications to scorebased generative modeling, Advances in neural information processing systems, Curran Associates, vol. 34, 17695–17709, ISBN 9781713845393, 2021. a
Buizza, R., Houtekamer, P. L., Pellerin, G., Toth, Z., Zhu, Y., and Wei, M.: A Comparison of the ECMWF, MSC, and NCEP Global Ensemble Prediction Systems, Mon. Weather Rev., 133, 1076–1097, https://doi.org/10.1175/MWR2905.1, 2005. a
Cachay, S. R., Zhao, B., James, H., and Yu, R.: DYffusion: A Dynamicsinformed Diffusion Model for Spatiotemporal Forecasting, arXiv [preprint], https://doi.org/10.48550/arXiv.2306.01984, 2023. a
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G.: A Simple Framework for Contrastive Learning of Visual Representations, in: Proceedings of the 37th International Conference on Machine Learning, 1597–1607, PMLR, 119, ISSN 26403498, 2020. a
Chen, T., Liu, G.H., and Theodorou, E. A.: Likelihood Training of Schrödinger Bridge Using ForwardBackward SDEs Theory, arXiv [preprint], https://doi.org/10.48550/arXiv.2110.11291, 2023. a
Chen, T.C., Penny, S. G., Whitaker, J. S., Frolov, S., Pincus, R., and Tulich, S.: Correcting Systematic and StateDependent Errors in the NOAA FV3GFS Using Neural Networks, J. Adv. Model. Earth Sy., 14, e2022MS003309, https://doi.org/10.1029/2022MS003309, 2022. a
De Bortoli, V.: Convergence of Denoising Diffusion Models under the Manifold Hypothesis, arXiv [preprint], https://doi.org/10.48550/arXiv.2208.05314, 2022. a
Demaeyer, J., Penny, S. G., and Vannitsem, S.: Identifying Efficient Ensemble Perturbations for Initializing SubseasonalToSeasonal Prediction, J. Adv. Model. Earth Sy., 14, e2021MS002828, https://doi.org/10.1029/2021MS002828, 2022. a
Devlin, J., Chang, M.W., Lee, K., and Toutanova, K.: BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186, https://doi.org/10.18653/v1/N191423, 2019. a
Dhariwal, P. and Nichol, A.: Diffusion Models Beat GANs on Image Synthesis, in: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol. 34, 8780–8794, ISBN 9781713845393, 2021. a, b
Dockhorn, T., Vahdat, A., and Kreis, K.: ScoreBased Generative Modeling with CriticallyDamped Langevin Diffusion, arXiv [preprint], https://doi.org/10.48550/arXiv.2112.07068, 2022. a
Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M., and Hon, H.W.: Unified Language Model Pretraining for Natural Language Understanding and Generation, in: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol. 32, ISBN 9781713807933, 2019. a
Efron, B.: Tweedie's Formula and Selection Bias, J. Am. Stat. Assoc., 106, 1602–1614, 2011. a
Evensen, G.: The Ensemble Kalman Filter: Theoretical Formulation and Practical Implementation, Ocean Dynam., 53, 343–367, https://doi.org/10.1007/s1023600300369, 2003. a, b
Falcon, W., Borovec, J., Wälchli, A., Eggert, N., Schock, J., Jordan, J., Skafte, N., Ir1dXD, Bereznyuk, V., Harris, E., Murrell, T., Yu, P., Præsius, S., Addair, T., Zhong, J., Lipin, D., Uchida, S., Bapat, S., Schröter, H., Dayma, B., Karnachev, A., Kulkarni, A., Komatsu, S., Martin.B, SCHIRATTI, J.B., Mary, H., Byrne, D., Eyzaguirre, C., cinjon, and Bakhtin, A.: PyTorchLightning: 0.7.6 Release, Zenodo [code], https://doi.org/10.5281/zenodo.3828935, 2020. a
Farchi, A., Laloyaux, P., Bonavita, M., and Bocquet, M.: Using Machine Learning to Correct Model Error in Data Assimilation and Forecast Applications, Q. J. Roy. Meteor. Soc., 147, 3067–3084, https://doi.org/10.1002/qj.4116, 2021. a
Finn, T.: Ddmattractor (Version initial_submission), Zenodo [code], https://doi.org/10.5281/zenodo.8406184, 2023. a, b
Finn, T. S., Durand, C., Farchi, A., Bocquet, M., Chen, Y., Carrassi, A., and Dansereau, V.: Deep learning subgridscale parametrisations for shortterm forecasting of seaice dynamics with a Maxwell elastobrittle rheology, The Cryosphere, 17, 2965–2991, https://doi.org/10.5194/tc1729652023, 2023. a
Gagne II, D. J., Christensen, H. M., Subramanian, A. C., and Monahan, A. H.: Machine Learning for Stochastic Parameterization: Generative Adversarial Networks in the Lorenz '96 Model, J. Adv. Model. Earth Sy., 12, e2019MS001896, https://doi.org/10.1029/2019MS001896, 2020. a
Grooms, I.: Analog Ensemble Data Assimilation and a Method for Constructing Analogs with Variational Autoencoders, Q. J. Roy. Meteor. Soc., 147, 139–149, https://doi.org/10.1002/qj.3910, 2021. a
Grooms, I., Renaud, C., Stanley, Z., and Yang, L. M.: Analog Ensemble Data Assimilation in a Quasigeostrophic Coupled Model, Q. J. Roy. Meteo. Soc., 149, 1018–1037, https://doi.org/10.1002/qj.4446, 2023. a
Hamill, T. M. and Snyder, C.: A Hybrid Ensemble Kalman Filter–3D Variational Analysis Scheme, Mon. Weather Rev., 128, 2905–2919, https://doi.org/10.1175/15200493(2000)128<2905:AHEKFV>2.0.CO;2, 2000. a
He, K., Zhang, X., Ren, S., and Sun, J.: Deep Residual Learning for Image Recognition, arXiv [preprint], https://doi.org/10.48550/arXiv.1512.03385, 2015. a
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R.: Masked Autoencoders Are Scalable Vision Learners, arXiv [preprint], https://doi.org/10.48550/arXiv.2111.06377, 2021. a
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S.: GANs Trained by a Two TimeScale Update Rule Converge to a Local Nash Equilibrium, arXiv [preprint], https://doi.org/10.48550/arXiv.1706.08500, 2017. a
Ho, J., Jain, A., and Abbeel, P.: Denoising diffusion probabilistic models, in: Advances in neural information processing systems, edited by: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., Curran Associates, Inc., ISBN 9781713829546, Vol. 33, 6840–6851, 2020. a, b, c, d, e, f, g
Hoffmann, S. and Lessig, C.: AtmoDist: Selfsupervised Representation Learning for Atmospheric Dynamics, Environmental Data Sci., 2, e6, https://doi.org/10.1017/eds.2023.1, 2023. a
Holzschuh, B., Vegetti, S., and Thuerey, N.: Solving inverse physics problems with score matching, in: Advances in neural information processing systems, edited by: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., Curran Associates, Inc., Vol. 36, 61888–61922, ISBN 9781713899921, 2023. a
Hunt, B. R., Kostelich, E. J., and Szunyogh, I.: Efficient Data Assimilation for Spatiotemporal Chaos: A Local Ensemble Transform Kalman Filter, Physica D, 230, 112–126, https://doi.org/10.1016/j.physd.2006.11.008, 2007. a
Hyvärinen, A.: Estimation of NonNormalized Statistical Models by Score Matching, J. Mach. Learn. Res., 6, 695–709, 2005. a
JolicoeurMartineau, A., Li, K., PichéTaillefer, R., Kachman, T., and Mitliagkas, I.: Gotta Go Fast When Generating Data with ScoreBased Models, arXiv [preprint], https://doi.org/10.48550/arXiv.2105.14080, 2022. a
Kingma, D., Salimans, T., Poole, B., and Ho, J.: Variational Diffusion Models, in: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol. 34, 21696–21707, ISBN 9781713845393, 2021. a, b
Kingma, D. P. and Ba, J.: Adam: A Method for Stochastic Optimization, arXiv [preprint], https://doi.org/10.48550/arXiv.1412.6980, 2017. a
Kretschmer, M., Hunt, B. R., and Ott, E.: Data Assimilation Using a Climatologically Augmented Local Ensemble Transform Kalman Filter, Tellus A, 67, 26617, https://doi.org/10.3402/tellusa.v67.26617, 2015. a
Latif, M.: The Roadmap of Climate Models, Nat. Comput. Sci., 2, 536–538, https://doi.org/10.1038/s43588022003220, 2022. a
Lguensat, R., Tandeo, P., Ailliot, P., Pulido, M., and Fablet, R.: The Analog Data Assimilation, Mon. Weather Rev., 145, 4093–4107, https://doi.org/10.1175/MWRD160441.1, 2017. a
Li, L., Carver, R., LopezGomez, I., Sha, F., and Anderson, J.: SEEDS: Emulation of Weather Forecast Ensembles with Diffusion Models, arXiv [preprint], https://doi.org/10.48550/arXiv.2306.14066, 2023a. a
Li, X., Feng, M., Ran, Y., Su, Y., Liu, F., Huang, C., Shen, H., Xiao, Q., Su, J., Yuan, S., and Guo, H.: Big Data in Earth System Science and Progress towards a Digital Twin, Nature Rev. Earth Environ., 4, 319–332, https://doi.org/10.1038/s4301702300409w, 2023b. a
Lorenc, A. C.: The Potential of the Ensemble Kalman Filter for NWP – a Comparison with 4DVar, Q. J. Roy. Meteor. Soc., 129, 3183–3203, https://doi.org/10.1256/qj.02.132, 2003. a
Lorenz, E. N.: Deterministic Nonperiodic Flow, J. Atmos. Sci., 20, 130–141, https://doi.org/10.1175/15200469(1963)020<0130:DNF>2.0.CO;2, 1963. a, b
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J.: DPMSolver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps, arXiv [preprint], https://doi.org/10.48550/arXiv.2206.00927, 2022. a
Luo, C.: Understanding Diffusion Models: A Unified Perspective, arXiv [preprint], https://doi.org/10.48550/arXiv.2208.11970, 2022. a
Luo, G., Dunlap, L., Park, D. H., Holynski, A., and Darrell, T.: Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence, arXiv [preprint], https://doi.org/10.48550/arXiv.2305.14334, 2023. a
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., and Ermon, S.: SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, arXiv [preprint], https://doi.org/10.48550/arXiv.2108.01073, 2022. a, b
Mittal, S., Abstreiter, K., Bauer, S., Schölkopf, B., and Mehrjou, A.: Diffusion Based Representation Learning, in: Proceedings of the 40th International Conference on Machine Learning, PMLR, 202, 24963–24982, ISSN 26403498, 2023. a
Molteni, F., Buizza, R., Palmer, T. N., and Petroliagis, T.: The ECMWF Ensemble Prediction System: Methodology and Validation, Q. J. Roy. Meteor. Soc., 122, 73–119, https://doi.org/10.1002/qj.49712252905, 1996. a
Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K., and Grover, A.: ClimaX: A Foundation Model for Weather and Climate, arXiv [preprint], https://doi.org/10.48550/arXiv.2301.10343, 2023. a
Nichol, A. and Dhariwal, P.: Improved Denoising Diffusion Probabilistic Models, arXiv [preprint], https://doi.org/10.48550/arXiv.2102.09672, 2021. a, b, c, d, e
Oke, P. R., Allen, J. S., Miller, R. N., Egbert, G. D., and Kosro, P. M.: Assimilation of Surface Velocity Data into a Primitive Equation Coastal Ocean Model, J. Geophys. Res.Oceans, 107, 51–525, https://doi.org/10.1029/2000JC000511, 2002. a, b
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S.: PyTorch: An Imperative Style, HighPerformance Deep Learning Library, in: Advances in Neural Information Processing Systems 32, edited by: Wallach, H., Larochelle, H., Beygelzimer, A., AlchéBuc, F., Fox, E., and Garnett, R., Curran Associates, Inc., 8024–8035, ISBN 9781713807933, 2019. a
Peebles, W. and Xie, S.: Scalable Diffusion Models with Transformers, arXiv [preprint], https://doi.org/10.48550/arXiv.2212.09748, 2023. a
Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A.: FiLM: Visual Reasoning with a General Conditioning Layer, arXiv [preprint], https://doi.org/10.48550/arXiv.1709.07871, 2017. a
Price, I., SanchezGonzalez, A., Alet, F., Andersson, T. R., ElKadi, A., Masters, D., Ewalds, T., Stott, J., Mohamed, S., Battaglia, P., Lam, R., and Willson, M.: GenCast: Diffusionbased Ensemble Forecasting for MediumRange Weather, arXiv [preprint], https://doi.org/10.48550/arXiv.2312.15796, 2024. a
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I.: Improving Language Understanding by Generative PreTraining, https://openai.com/research/languageunsupervised (last access: 18 September 2024), 2018. a
Rahimi, A. and Recht, B.: Random Features for LargeScale Kernel Machines, in: Advances in Neural Information Processing Systems, 1177–1184, ISBN 9781605603520, 2007. a
Reich, S.: Data Assimilation: The Schrödinger Perspective, Acta Numerica, 28, 635–711, https://doi.org/10.1017/S0962492919000011, 2019. a
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B.: HighResolution Image Synthesis With Latent Diffusion Models, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, 10674–10685, https://doi.org/10.1109/CVPR52688.2022.01042, 2022. a
Rozet, F. and Louppe, G.: ScoreBased Data Assimilation, Adv. Neur. Inf., 36, 40521–40541, 2023. a
Salimans, T. and Ho, J.: Progressive Distillation for Fast Sampling of Diffusion Models, arXiv [preprint], https://doi.org/10.48550/arXiv.2202.00512, 2022. a, b, c, d
Scher, S. and Messori, G.: Generalization properties of feedforward neural networks trained on Lorenz systems, Nonlin. Processes Geophys., 26, 381–399, https://doi.org/10.5194/npg263812019, 2019. a, b
Schraff, C., Reich, H., Rhodin, A., Schomburg, A., Stephan, K., Periáñez, A., and Potthast, R.: KilometreScale Ensemble Data Assimilation for the COSMO Model (KENDA), Q. J. Roy. Meteor. Soc., 142, 1453–1472, https://doi.org/10.1002/qj.2748, 2016. a
SohlDickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S.: Deep Unsupervised Learning Using Nonequilibrium Thermodynamics, arXiv [preprint], https://doi.org/10.48550/arXiv.1503.03585, 2015. a
Song, Y., Garg, S., Shi, J., and Ermon, S.: Sliced Score Matching: A Scalable Approach to Density and Score Estimation, in: Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR, 115, 574–584, ISSN 26403498, 2019. a
Song, J., Meng, C., and Ermon, S.: Denoising Diffusion Implicit Models, in: International Conference on Learning Representations, ISBN 9798331300081, 2021a. a, b, c, d, e, f, g, h
Song, Y., SohlDickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B.: ScoreBased Generative Modeling through Stochastic Differential Equations, arXiv [preprint], https://doi.org/10.48550/arXiv.2011.13456, 2021. a, b, c, d, e
Sutherland, D. J. and Schneider, J.: On the Error of Random Fourier Features, arXiv [preprint], https://doi.org/10.48550/arXiv.1506.02785, 2015. a
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A.: Going Deeper with Convolutions, arXiv [preprint], https://doi.org/10.48550/arXiv.1409.4842, 2014. a
Tandeo, P., Ailliot, P., and Sévellec, F.: Datadriven reconstruction of partially observed dynamical systems, Nonlin. Processes Geophys., 30, 129–137, https://doi.org/10.5194/npg301292023, 2023. a
Toth, Z. and Kalnay, E.: Ensemble Forecasting at NMC: The Generation of Perturbations, B. Am. Meteorol. Soc., 74, 2317–2330, https://doi.org/10.1175/15200477(1993)074<2317:EFANTG>2.0.CO;2, 1993. a
Van Rossum, G.: Python Tutorial, Technical Report CSR9526, Tech. rep., Centrum voor Wiskunde en Informatica (CWI), Amsterdam, ISSN 0169118X, 1995. a
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I.: Attention Is All You Need, arXiv [preprint], arXiv:1706.03762, https://doi.org/10.48550/arXiv.1706.03762, 2017. a, b
Vincent, P.: A Connection Between Score Matching and Denoising Autoencoders, Neural Comput., 23, 1661–1674, https://doi.org/10.1162/NECO_a_00142, 2011. a
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A.: Extracting and Composing Robust Features with Denoising Autoencoders, Proceedings of the 25th international conference on Machine learning, 1096–1103, https://doi.org/10.1145/1390156.1390294, 2008. a
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., and Bottou, L.: Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion., J. Mach. Learn. Res., 11, 3371−3408, 2010. a
Vlachas, P. R., Pathak, J., Hunt, B. R., Sapsis, T. P., Girvan, M., Ott, E., and Koumoutsakos, P.: Backpropagation Algorithms and Reservoir Computing in Recurrent Neural Networks for the Forecasting of Complex Spatiotemporal Dynamics, Neural Networks, 126, 191–217, https://doi.org/10.1016/j.neunet.2020.02.016, 2020. a
Xiang, W., Yang, H., Huang, D., and Wang, Y.: Denoising Diffusion Autoencoders Are Unified Selfsupervised Learners, arXiv [preprint], https://doi.org/10.48550/arXiv.2303.09769, 2023. a
Yang, G. and Sommer, S.: A Denoising Diffusion Model for Fluid Field Prediction, arXiv [preprint], https://doi.org/10.48550/arXiv.2301.11661, 2023. a
Yang, L. M. and Grooms, I.: Machine Learning Techniques to Construct Patched Analog Ensembles for Data Assimilation, J. Comput. Phys., 443, 110532, https://doi.org/10.1016/j.jcp.2021.110532, 2021. a
Yang, X. and Wang, X.: Diffusion Model as Representation Learner, 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 18892–18903, https://doi.org/10.1109/ICCV51070.2023.01736, 2023. a
Zhang, Z., Zhao, Z., and Lin, Z.: Unsupervised Representation Learning from Pretrained Diffusion Probabilistic Models, Adv. Neur. Inf., 35, 22117–22130, 2022. a
 Abstract
 Introduction
 Denoising diffusion models
 Downstream tasks for unconditional denoising diffusion models
 Experiments
 Summary and discussion
 Conclusions
 Appendix A: A dynamical system point of view on denoising diffusion models
 Appendix B: Data assimilation prior distributions from denoising diffusion models
 Appendix C: Noise scheduler
 Appendix D: Configurations
 Appendix E: Validation metrics
 Appendix F: Additional results for the learned representation
 Code and data availability
 Author contributions
 Competing interests
 Disclaimer
 Acknowledgements
 Financial support
 Review statement
 References
 Abstract
 Introduction
 Denoising diffusion models
 Downstream tasks for unconditional denoising diffusion models
 Experiments
 Summary and discussion
 Conclusions
 Appendix A: A dynamical system point of view on denoising diffusion models
 Appendix B: Data assimilation prior distributions from denoising diffusion models
 Appendix C: Noise scheduler
 Appendix D: Configurations
 Appendix E: Validation metrics
 Appendix F: Additional results for the learned representation
 Code and data availability
 Author contributions
 Competing interests
 Disclaimer
 Acknowledgements
 Financial support
 Review statement
 References