Research article 14 Aug 2019
Research article  14 Aug 2019
Joint stateparameter estimation of a nonlinear stochastic energy balance model from sparse noisy data
 ^{1}Department of Mathematics, Johns Hopkins University, Baltimore, Maryland, USA
 ^{2}Institut für Umweltphysik, RuprechtKarlsUniversität Heidelberg, Heidelberg, Germany
 ^{3}Institut für Geowissenschaften und Meteorologie, Rheinische FriedrichWilhelmsUniversität Bonn, Bonn, Germany
 ^{4}School of Earth and Ocean Sciences, University of Victoria, Victoria, British Columbia, Canada
 ^{1}Department of Mathematics, Johns Hopkins University, Baltimore, Maryland, USA
 ^{2}Institut für Umweltphysik, RuprechtKarlsUniversität Heidelberg, Heidelberg, Germany
 ^{3}Institut für Geowissenschaften und Meteorologie, Rheinische FriedrichWilhelmsUniversität Bonn, Bonn, Germany
 ^{4}School of Earth and Ocean Sciences, University of Victoria, Victoria, British Columbia, Canada
Correspondence: Fei Lu (feilu@math.jhu.edu)
Hide author detailsCorrespondence: Fei Lu (feilu@math.jhu.edu)
While nonlinear stochastic partial differential equations arise naturally in spatiotemporal modeling, inference for such systems often faces two major challenges: sparse noisy data and illposedness of the inverse problem of parameter estimation. To overcome the challenges, we introduce a strongly regularized posterior by normalizing the likelihood and by imposing physical constraints through priors of the parameters and states.
We investigate joint parameterstate estimation by the regularized posterior in a physically motivated nonlinear stochastic energy balance model (SEBM) for paleoclimate reconstruction. The highdimensional posterior is sampled by a particle Gibbs sampler that combines a Markov chain Monte Carlo (MCMC) method with an optimal particle filter exploiting the structure of the SEBM. In tests using either Gaussian or uniform priors based on the physical range of parameters, the regularized posteriors overcome the illposedness and lead to samples within physical ranges, quantifying the uncertainty in estimation. Due to the illposedness and the regularization, the posterior of parameters presents a relatively large uncertainty, and consequently, the maximum of the posterior, which is the minimizer in a variational approach, can have a large variation. In contrast, the posterior of states generally concentrates near the truth, substantially filtering out observation noise and reducing uncertainty in the unconstrained SEBM.
Physically motivated nonlinear stochastic (partial) differential equations (SDEs and SPDEs) are natural models of spatiotemporal processes with uncertainty in geoscience. In particular, such models arise in the problem of reconstructing geophysical fields from sparse and noisy data (see, e.g., Sigrist et al., 2015; Guillot et al., 2015; Tingley et al., 2012, and the references therein). The nonlinear differential equations, derived from physical principles, often come with unknown but physically constrained parameters also to be determined from data. This promotes the problem of joint stateparameter estimation from sparse and noisy data. When the parameters are interrelated, which is often the case in nonlinear models, their estimation can be an illposed inverse problem. Physical constraints on the parameters must then be taken into account. In variational approaches, physical constraints are imposed using a regularization term in a cost function, whose minimizer provides an estimator of the parameters and states. In a Bayesian approach, the physical constraints are encoded in prior distributions, extending the regularized cost function in the variational approach to a posterior and quantifying the estimation uncertainty. When the true parameters are known, the Bayesian approach has demonstrated great success in state estimation, thanks to the developments in Monte Carlo sampling and data assimilation techniques (see, e.g., Carrassi et al., 2018; Law et al., 2015; VetraCarvalho et al., 2018). However, the problem of joint stateparameter estimation, especially when the parameter estimation is illposed, has had relatively little success in nonlinear cases and remains a challenge (Kantas et al., 2009).
In this paper, we investigate a Bayesian approach for joint state and parameter estimation of a nonlinear twodimensional stochastic energy balance model (SEBM) in the context of spatial–temporal paleoclimate reconstructions of temperature fields from sparse and noisy data (Tingley and Huybers, 2010; Steiger et al., 2014; Fang and Li, 2016; Goosse et al., 2010). In particular, we consider a model of the energy balance of the atmosphere similar to those often used in idealized climate models (e.g., Fanning and Weaver, 1996; Weaver et al., 2001; Rypdal et al., 2015) to study climate variability and climate sensitivity. The use of such a model in paleoclimate reconstruction aims at improving the physical consistency of temperature reconstructions during, e.g., the last deglaciation and the Holocene by combining indirect observations, socalled proxy data, with physically motivated stochastic models.
The SEBM models surface air temperature, explicitly taking into account sinks, sources, and horizontal transport of energy in the atmosphere, with an additive stochastic forcing incorporated to account for unresolved processes and scales. The model takes the form of a nonlinear SPDE with unknown parameters to be inferred from data. These unknown parameters are associated with processes in the energy budget (e.g., radiative transfer, air–sea energy exchange) that are represented in a simplified manner in the SEBM, and may change with a changing climate. The parameters must fall in a prescribed range such that the SEBM is physically meaningful. Specifically, they must be in sufficiently close balance for the stationary temperature of the SEBM to be within a physically realistic range. As we will show, the parametric terms arising from this physically based model are strongly correlated, leading to a Fisher information matrix that is illconditioned. Therefore, the parameter estimation is an illposed inverse problem, and the maximum likelihood estimators of individual parameters have large variations and often fall out of the physical range.
To overcome the illposedness in parameter estimation, we introduce a new strongly regularized posterior by normalizing the likelihood and by imposing the physical constraints through priors on the parameters and the states, based on physical constraints and the climatological distribution. In the regularized posterior, the prior has the same weight as the normalized likelihood to enforce the support of the posterior to be in the physical range. Such a regularized posterior is a natural extension of the regularized cost function in a variational approach: the maximum of the posterior (MAP) is the same as the minimizer of the regularized cost function, but the posterior quantifies the uncertainty in the estimator.
The regularized posterior of the states and parameters is highdimensional and nonGaussian. It is represented by its samples, which provide an empirical approximation of the distribution and allow efficient computation of quantities of interest such as posterior means. The samples are drawn using a particle Gibbs sampler with ancestor sampling (PGAS, Lindsten et al., 2014), a special sampler in the family of particle Markov chain Monte Carlo (MCMC) methods (Andrieu et al., 2010) that combines the strengths of both MCMC and sequential Monte Carlo methods (see, e.g., Doucet and Johansen, 2011) to ensure the convergence of the empirical approximation to the highdimensional posterior. In the PGAS, we use an optimal particle filter that exploits the forward structure of the SEBM.
We consider two priors for the parameters, each based on their physical ranges: a uniform prior and a Gaussian prior with 3 standard deviations inside the range. We impose a prior for the states based on their overall climatological distribution. Tests show that the regularized posteriors overcome the illposedness and lead to samples of parameters and states within the physical ranges, quantifying the uncertainty in their estimation. Due to the regularization, the posterior of the parameters is supported on a relatively large range. Consequently, the MAP of the parameters has a large variation, and it is important to use the posterior to assess the uncertainty. In contrast, the posterior of the states generally concentrates near the truth, substantially filtering out the observational noise and reducing the uncertainty in state reconstruction.
Tests also show that the regularized posterior is robust to spatial sparsity of observations, with sparser observations leading to larger uncertainties. However, due to the need for regularization to overcome illposedness, the uncertainty in the posterior of the parameters can not be eliminated by increasing the number of observations in time. Therefore, we suggest alternative approaches, such as reparametrization of the nonlinear function according to the climatological distribution or nonparametric Bayesian inference (see, e.g., Müller and Mitra, 2013; Ghosal and Van der Vaart, 2017), to avoid illposedness.
The rest of the paper is organized as follows. Section 2 introduces the SEBM and its discretization, and formulates a statespace model. We also outline in this section the Bayesian approach to the joint parameterstate estimation and the particle MCMC samplers. Section 3 analyzes the illposedness of the parameter estimation problem and introduces the regularized posterior. The regularized posterior is sampled by PGAS and numerical results are presented in Sect. 4. Discussions and conclusions are presented in Sects. 5 and 6. Technical details of the estimation procedure are described in Appendix A.
After providing a brief physical introduction to the SEBM, we present its discretization and the observation model by representing them as a statespace model suitable for application of sequential Monte Carlo methods in Bayesian inference.
2.1 The stochastic energy balance model
The SEBM describes the evolution in space (both latitude and longitude) and time of the surface air temperature u(t,ξ):
where $\mathit{\xi}\in [\mathit{\pi},\mathit{\pi}]\times [\mathit{\pi}/\mathrm{2},\mathit{\pi}/\mathrm{2}]$ is the twodimensional coordinate on the sphere and the solution u(t,ξ) is periodic in longitude. Horizontal energy transport is represented as diffusion with diffusivity ν, while sources and sinks of atmospheric internal energy are represented by the nonlinear function g_{θ}(u):
with the unknown parameters θ. Upper and lower bounds of these three parameters, shown in Table 1, are derived from the energy balance model in Fanning and Weaver (1996), adjusted to current estimates of the Earth's global energy budget from Trenberth et al. (2009) using appropriate simplifications. The equilibrium solution of the SEBM for the average values of the parameters approximates the current global mean temperature closely, and the magnitude of sinks and sources approximates the respective magnitudes in Trenberth et al. (2009) well. The physical ranges of the parameters are very conservative and cover current estimates of the global mean temperature during the Quaternary (Snyder, 2016). The state variable and the parameters in the model have been nondimensionalized so that the equilibrium solution of Eq. (1) with f=0 is approximately equal to 1 and 1 time unit represents a year.
The nonlinear function g_{θ}(u) aggregates parametrizations from Fanning and Weaver (1996) for incoming shortwave radiation, outgoing longwave radiation, radiative air–surface flux, sensible air–surface heat flux, and the latent heat flux into the atmosphere according to their polynomial order. The quartic nonlinearity of the function g_{θ}(u) arises from the Stefan–Boltzmann dependence of longwave radiative fluxes on atmospheric temperature, while a linear feedback is included to represent state dependence of, e.g., surface energy fluxes and albedo. Inclusion of quadratic and cubic nonlinearities in g_{θ}(u) (to account for nonlinearities in the feedbacks just noted) was found to exacerbate the illposedness of the model without qualitatively changing the character of the model dynamics within the parameter range appropriate for the study of Quaternary climate variability (e.g., without admitting multiple deterministic equilibria associated with the ice–albedo feedback). In reality, the diffusivity ν and the parameters θ_{j}, $j=(\mathrm{0},\mathrm{1},\mathrm{4})$ will depend on latitude, longitude, and time. We will neglect this complexity in our idealized analysis.
The stochastic term f(t,ξ), which models the net effect of unresolved or oversimplified processes in the energy budget, is a centered Gaussian field that is white in time and colored in space, specified by an isotropic Matérn covariance function with order α=1 and scale ρ>0. That is,
with the covariance kernel C(r) being the Matérn covariance kernel given by
where Γ is the gamma function, ρ is a scaling factor, and K_{α} is the modified Bessel function of the second kind. We focus on the estimation of the parameters θ and assume that ν and the parameters of f are known. Estimating ν in energy balance models with data assimilation methods is studied in Annan et al. (2005), whereas estimation of parameters of f in the context of linear SPDEs is covered for example in Lindgren et al. (2011).
In a paleoclimate context, temperature observations are sparse (in space and time) and derived from climatic proxies, such as pollen assemblages, isotopic compositions, and tree rings, which are indirect measures of the climate state. To simplify our analysis, we neglect the potentially nonlinear transformations associated with the proxies and focus on the effect of observational sparseness. This is a common strategy in the testing of climate field reconstruction methods (e.g., Werner et al., 2013). As such, we take the data to be noisy observations of the solution at d_{o} locations:
for $i=\mathrm{1},\mathrm{\dots},{d}_{\mathrm{o}}$, where each ${\mathit{\xi}}_{i}\in [\mathit{\pi},\mathit{\pi}]\times [\mathit{\pi}/\mathrm{2},\mathit{\pi}/\mathrm{2}]$ is a location of observation, H is the observation operator, and ${\mathit{\u03f5}}_{i}\left(t\right)\sim \mathcal{N}(\mathrm{0},{\mathit{\sigma}}_{\mathit{\u03f5}}^{\mathrm{2}})$ are independent identically distributed (iid) Gaussian noise. The data are sparse in the sense that only a small number of the spatial locations are observed.
2.2 Statespace model representation
In practice, the differential equations are represented by their discretized systems and the observations are discrete in time; therefore, we consider only the statespace model based on a discretization of the SEBM. We refer the reader to Prakasa Rao (2001), Apte et al. (2007), Hairer et al. (2007), Maslowski and Tudor (2013), and Llopis et al. (2018) for studies about inference of SPDEs in a continuoustime setting.
2.2.1 The state model
We discretize the SPDE Eq. (1) using linear finite elements in space and a semibackward Euler method in time, using the computationally efficient Gaussian Markov random field approximation of the Gaussian field by Lindgren et al. (2011) (see details in Sect. A1). We write the discretized equation as a standard statespace model:
where ${\mathit{\mu}}_{\mathit{\theta}}:{\mathbb{R}}^{{d}_{\mathrm{b}}}\to {\mathbb{R}}^{{d}_{\mathrm{b}}}$ is the deterministic function and {W_{n}} is a sequence of iid Gaussian noise with mean zero and covariance R described in more detail in Eq. (A19). Here the subscript n is a time index. Therefore, the transition probability density ${p}_{\mathit{\theta}}\left({u}_{n+\mathrm{1}}\right{u}_{n})$, the probability density of U_{n+1} conditional on U_{n} and θ, is
2.2.2 The observation model
In discrete form, we assume that the locations of observation are the nodes of the finite elements. Then the observation function in Eq. (5) is simply ${H}_{i}\left({U}_{\mathrm{n}}\right)={U}_{\mathrm{n},{k}_{i}}$, with ${k}_{i}\in \mathit{\{}\mathrm{1},\mathrm{\dots},d\mathit{\}}$ denoting the index of the node under observation, for $i=\mathrm{1},\mathrm{\dots},{d}_{\mathrm{0}}$, and we can write the observation model as
where $\mathbf{H}\in {\mathbb{R}}^{{d}_{\mathrm{o}}\times {d}_{\mathrm{b}}}$ is called the observation matrix and {ϵ_{n}} is a sequence of iid Gaussian noise with distribution 𝒩(0,Q), where $\mathbf{Q}=\mathrm{Diag}\mathit{\left\{}{\mathit{\sigma}}_{i}^{\mathrm{2}}\mathit{\right\}}$. Equivalently, the probability of observing y_{n} given state U_{n} is
2.3 Bayesian inference for SSM
Given observations ${y}_{\mathrm{1}:N}:=({y}_{\mathrm{1}},\mathrm{\dots},{y}_{N})$, our goal is to jointly estimate the state ${U}_{\mathrm{1}:N}:=({U}_{\mathrm{1}},\mathrm{\dots},{U}_{N})$ and the parameter vector $\mathit{\theta}:=({\mathit{\theta}}_{\mathrm{0}},{\mathit{\theta}}_{\mathrm{1}},{\mathit{\theta}}_{\mathrm{4}})$ in the statespace model Eqs. (6)–(9). The Bayesian approach estimates the joint distribution of $({U}_{\mathrm{1}:N},\mathit{\theta})$ conditional on the observations by drawing samples to form an empirical approximation of the highdimensional posterior. The empirical posterior efficiently quantifies the uncertainty in the estimation. Therefore, the Bayesian approach has been widely used (see the review of Kantas et al., 2009, and the references therein).
Following Bayes' rule, the joint posterior distribution of $({U}_{\mathrm{1}:N},\mathit{\theta})$ can be written as
where p(θ) is the prior of the parameters and ${p}_{\mathit{\theta}}\left({y}_{\mathrm{1}:N}\right)=\int {p}_{\mathit{\theta}}\left({u}_{\mathrm{1}:N}\right){p}_{\mathit{\theta}}\left({y}_{\mathrm{1}:N}\right{u}_{\mathrm{1}:N})\mathrm{d}{u}_{\mathrm{1}:N}$ is the unknown marginal probability density function of the observations. In the importance sampling approximation to the posterior, we do not need to know the value of p_{θ}(y_{1:N}), because as a normalizing constant it will be cancelled out in the importance weights of samples. The quantity ${p}_{\mathit{\theta}}\left({y}_{\mathrm{1}:N}\right{u}_{\mathrm{1}:N})$ is the likelihood of the observations y_{1:N} conditional on the state U_{1:N} and the parameter θ, which can be explicitly derived from the observation model Eq. (8):
with p(y_{n}u_{n}) given in Eq. (9). Finally, the probability density function of the state U_{1:N} given parameter θ can be derived from the state model Eq. (6):
with ${p}_{\mathit{\theta}}\left({u}_{n+\mathrm{1}}\right{u}_{n})$ specified by Eq. (7).
2.4 Sampling the posterior by particle MCMC methods
In practice, we are interested in the expectation of quantities of interest or the probability of certain events. These computations involve integrations of the posterior that can neither be computed analytically nor by numerical quadrature methods due to the curse of dimensionality: the posterior is a highdimensional nonGaussian distribution involving variables with a dimension at the scale of thousands to millions. Monte Carlo methods generate samples to approximate the posterior by the empirical distribution, so that quantities of interest can be computed efficiently.
MCMC methods are popular Monte Carlo methods (see, e.g., Liu, 2001) that generate samples along a Markov chain with the posterior as the invariant measure. For joint distributions of parameters and states, a standard MCMC method is Gibbs sampling which consists of alternatively updating the state variable U_{1:N} conditional on θ and y_{1:N} by sampling
and then updating the parameter θ conditional on ${U}_{\mathrm{1}:N}={u}_{\mathrm{1}:N}$ by sampling the marginal posterior of θ:
Due to the high dimensionality of U_{1:N}, a major difficulty in sampling $p\left({u}_{\mathrm{1}:N}\right\mathit{\theta},{y}_{\mathrm{1}:N})$ is the design of efficient proposal densities that can effectively explore the support of $p\left({u}_{\mathrm{1}:N}\right\mathit{\theta},{y}_{\mathrm{1}:N})$.
Another group of rapidly developing MC methods are sequential Monte Carlo (SMC) methods (Cappé et al., 2005; Doucet and Johansen, 2011) that exploit the sequential structure of statespace models to approximate the posterior densities $p\left({u}_{\mathrm{1}:n}\right\mathit{\theta},{y}_{\mathrm{1}:N})$ sequentially. SMC methods are efficient but suffer from the wellknown problem of depletion (or degeneracy), in which the marginal distribution $p\left({u}_{n}\right\mathit{\theta},{y}_{\mathrm{1}:N})$ becomes concentrated on a single sample as N−n increases (see Sect. A2 for more details).
The particle MCMC methods introduced in Andrieu et al. (2010) provide a framework for systematically combining SMC methods with MCMC methods, exploiting the strengths of both techniques. In the particle MCMC samplers, SMC algorithms provide highdimensional proposal distributions, and Markov transitions guide the SMC ensemble to sufficiently explore the target distribution. The transition is realized by a conditional SMC technique, in which a reference trajectory from the previous step is kept throughout the current step of SMC sampling.
In this study, we sample the posterior by PGAS (Lindsten et al., 2014), a particle MCMC method that enhances the mixing of the Markov chain by sampling the ancestor of the reference trajectory. For the SMC, we use an optimal particle filter, which takes advantage of the linear Gaussian observation model and the Gaussian transition density of the state variables in our current SEBM. More generally, when the observation model is nonlinear and the transition density is nonGaussian, the optimal particle filter can be replaced by implicit particle filters (Chorin and Tu, 2009; Morzfeld et al., 2012) or local particle filters (Penny and Miyoshi, 2016; Poterjoy, 2016; Farchi and Bocquet, 2018); we refer the reader to Carrassi et al. (2018), Law et al. (2015), and VetraCarvalho et al. (2018) for other data assimilation techniques. The details of the algorithm are provided in Sect. A3.
In this section, we first demonstrate and then analyze the failure of standard Bayesian inference of the parameters with the posteriors in Eq. (10). The standard Bayesian inference of the parameters fails in the sense that the posterior Eq. (10) tends to have a large probability mass at nonphysical parameter values. In the process of approximating the posterior by samples, the values of these samples often either hit the (upper or lower) bounds in Table 1 when we use a uniform prior or exceed these bounds when we use a Gaussian prior. As we shall show next, the standard Bayesian inverse problem is numerically illposed because the Fisher information matrix is illconditioned, which makes the inference numerically unreliable. Following the idea of regularization in variational approaches, we propose using regularized posteriors in the Bayesian inference. This approach unifies the Bayesian and variational approaches: the MAP is the minimizer of the regularized cost function in the variational approach, but the Bayesian approach quantifies the uncertainty of the estimator by the posterior.
3.1 Model settings and tests
Based on the physical upper and lower bounds in Table 1, we consider two priors for the parameters: a uniform distribution on these intervals and a Gaussian distribution centered at the median and with 3 standard deviations in the interval, as listed in Table 2.
Throughout this study, we shall consider a relatively small numerical mesh for the SPDE with only 12 nodes for the finite elements. Such a small mesh provides a toy model that can neatly represent the spatial structure on the sphere while allowing for systematic assessments of statistical properties of the Bayesian inference with moderate computational costs. Numerical tests show that the above FEM semibackward Euler scheme is stable for a time step size Δt=0.01 and a stochastic forcing with scale σ_{f}=0.1 (see Sect. A1 for more details about the discretization). A typical realization of the solution is shown in Fig. 1 (panels a and b), where we present the solution on the sphere at a fixed time with the 12node finiteelement mesh, as well as the trajectories of all 12 nodes.
The standard deviation of the observation noise is set to σ_{ϵ}=0.01, i.e., 1 order of magnitude smaller than the stochastic forcing and 2 orders of magnitude smaller than the climatological mean.
We first assume that 6 out of the 12 nodes are observed; we discuss results obtained using sparser or denser observations in the discussion section. Figure 1 also shows the climatological probability histogram of the true state variables and the partial noisy observations. The climatological distribution of the observations is close to that of the true state variables (with a slightly larger variance due to the noise). The histograms show that the state variables are centered around 1 and vary mostly in the interval [0.92,1.05]. We shall use a Gaussian approximation based on the climatological distribution of the partial noisy observations as a prior to constrain the state variables.
We summarize the settings of numerical tests in Table 3.
3.2 Illposedness of the standard Bayesian inference of parameters
By the Bernstein–von Mises theorem (see, e.g., Van der Vaart, 2000, chap. 10), the posterior distribution of the parameters conditional on the true state data approaches the likelihood distribution as the data size increases. That is, $p\left(\mathit{\theta}\right{u}_{\mathrm{1}:N})$ in Eq. (14) becomes close to the likelihood distribution $p\left({u}_{\mathrm{1}:N}\right\mathit{\theta})$ (which can be viewed as a distribution of θ) as the data size increases. Therefore, if the likelihood distribution is numerically degenerate (in the sense that some components are undetermined), then the Bayesian posterior will also become close to degenerate, so that the Bayesian inference for parameter estimation will be illposed. In the following, we show that for this model the likelihood is degenerate even if the full states are observed with zero observation noise and that the maximum likelihood estimators have large nonphysical fluctuations (particularly when the states are noisy). As a consequence, the standard Bayesian parameter inference fails by yielding nonphysical samples.
We show first that the likelihood distribution is numerically degenerate because the Fisher information matrix is illconditioned. Following the transition density Eq. (7), the log likelihood of the state {u_{1:N}} is
where c is a constant independent of $(\mathit{\theta},{u}_{\mathrm{1}:N})$. Since μ_{θ}(⋅) is linear in θ (cf. Eq. A19), the likelihood function is quadratic in θ and the corresponding scaled Fisher information matrix is
where the vectors ${G}_{\mathit{\theta},k}\left({u}_{n}\right)\in {\mathbb{R}}^{{d}_{\mathrm{b}}}$ are defined in Eq. (A20). As N→∞, the Fisher information matrix converges, by ergodicity of the system, to its expectation ${\left(\mathrm{\Delta}t{\mathit{\sigma}}_{\mathrm{f}}^{\mathrm{2}}\mathbb{E}\left[\right(\mathbf{A}{u}_{n}{)}^{\circ k}{A}_{\mathcal{T}}^{T}{\mathbf{C}}^{T}\mathbf{C}{A}_{\mathcal{T}}\left(\mathbf{A}{u}_{n}{)}^{\circ l}\right]\right)}_{k,l=\mathrm{0},\mathrm{1},\mathrm{4}}$, where the matrices A, A_{𝒯} and C, arising in the spatial–temporal discretization, are defined in Sect. A1. Intuitively, neglecting these matrices and viewing the vector u_{n} as a scalar, this expectation matrix could be reduced to $\left(\mathrm{\Delta}t{\mathit{\sigma}}_{\mathrm{f}}^{\mathrm{2}}\mathbb{E}\right[{u}_{n}^{k}{u}_{n}^{l}]{)}_{k,l=\mathrm{0},\mathrm{1},\mathrm{4}}$, which is illconditioned because u_{n} has a distribution concentrated near 1 with a standard deviation at the scale of 10^{−2} (see Fig. 1).
Figure 2 shows the means and standard deviations of the condition numbers (the ratio between the maximum and the minimum singular values) of the Fisher information matrices from 100 independent simulations. Each of these simulations generates a long trajectory of length 10^{5} using a parameter drawn randomly from the prior and computes the Fisher information matrices using the true trajectory of all 12 nodes, for subsamples of lengths N ranging from 10^{2} to 10^{5}. For both the Gaussian and uniform priors, the condition numbers are on the scale of 10^{8}–10^{11} and therefore the Fisher information matrix is illconditioned. In particular, the condition number increases as the data size increased, due to the illposedness of the inverse problem of parameter estimation.
The illconditioned Fisher information matrix leads to highly variable maximum likelihood estimators (MLEs), computed from F_{N}θ=b_{N} with ${b}_{N}=\frac{\mathrm{1}}{N}{\left({\sum}_{n=\mathrm{1}}^{N}{G}_{\mathit{\theta},k}\left({u}_{n}{)}^{T}{\mathbf{R}}^{\mathrm{1}}\right({u}_{n+\mathrm{1}}{\mathbf{M}}_{\mathrm{\Delta}t}^{\mathrm{1}}{\mathbf{M}}_{\mathrm{0}}{u}_{n})\right)}_{k=\mathrm{0},\mathrm{1},\mathrm{4}}$, which follows from Eq. (A20).
The illposedness is particularly problematic when {u_{1:N}} is observed with noise, as the illconditioned Fisher information matrix amplifies the noise in observations and leads to nonphysical estimators. Figure 3 shows the means and standard deviations of errors of MLEs computed from true and noisy trajectories in 100 independent simulations. In each of these simulations, the “noisy” trajectory is obtained by adding a white noise with standard deviation σ_{ϵ}=0.01 to a “true” trajectory generated from the system with a true parameter randomly drawn from the prior. For both Gaussian and uniform priors, the standard deviations and means of the errors of the MLE from the noisy trajectories are 1 order of magnitude larger than those from true trajectories. In particular, the variations are large when the data size is small. For example, when N=100, the standard deviation of the MLE for θ_{0} from noisy observations is on the order of 10^{3}, 2 orders of magnitude larger than its physical range in Table 2.
The standard deviations decrease as the data size increases, at the expected rate of $\mathrm{1}/\sqrt{N}$. However, the errors are too large to be practically reduced by increasing the size of the data: for example, a data size N=10^{10} is needed to reduce the standard deviation of θ_{4} to less than 0.1 (which is about 10 % the size of the physical range $[\mathrm{6.00},\mathrm{4.80}]$ as specified in Table 2). In summary, the illposedness leads to parameter estimators with large variations that are far outside the physical ranges of the parameters.
3.3 Regularized posteriors
To overcome the illposedness of the parameter estimation problem, we introduce strongly regularized posteriors by normalizing the likelihood function. In addition, to prevent unphysical values of the states, we further regularize the state variables in the likelihood by an uninformative climatological prior. That is, consider the regularized posterior:
where $Z:=\int p\left(\mathit{\theta}\right){\left[\frac{{p}^{c}\left({u}_{\mathrm{1}:N}\right){p}_{\mathit{\theta}}\left({u}_{\mathrm{1}:N}\right){p}_{\mathit{\theta}}\left({y}_{\mathrm{1}:N}\right{u}_{\mathrm{1}:N})}{{p}_{\mathit{\theta}}\left({y}_{\mathrm{1}:N}\right)}\right]}^{\mathrm{1}/N}\mathrm{d}\mathit{\theta}\mathrm{d}{u}_{\mathrm{1}:N}$ is a normalizing constant and p^{c}(u_{1:N}) is the prior of the states estimated from a Gaussian fit to climatological statistics of the observations, neglecting correlations. That is, we set p^{c}(u_{1:N}) as
with ${\mathit{\sigma}}_{c}=\mathrm{2}\sqrt{{\mathit{\sigma}}_{o}^{\mathrm{2}}{\mathit{\sigma}}_{\mathit{\u03f5}}^{\mathrm{2}}}$, where ${\stackrel{\mathrm{\u203e}}{u}}_{c}$ and σ_{o} are the mean and standard deviation of the observations over all states. Here the multiplicative factor 2 aims for a larger band to avoid an overly narrow prior for the states.
This prior can be viewed as a joint distribution of the state variables assuming all components are independent identically Gaussian distributed with mean ${\stackrel{\mathrm{\u203e}}{u}}_{c}$ and variance ${\mathit{\sigma}}_{c}^{\mathrm{2}}$. Clearly, it uses the minimum amount of information about the state variables, and we expect it can be improved by taking into consideration spatial correlations or additional field knowledge in practice.
The regularized posterior can be viewed as an extension of the regularized cost function in the variational approach. In fact, the negative logarithm of the regularized posterior is the same (up to a multiplicative factor $\frac{\mathrm{1}}{N}$ and an additive constant $\mathrm{log}Z\frac{\mathrm{1}}{N}\mathrm{log}{p}_{\mathit{\theta}}\left({y}_{\mathrm{1}:N}\right)$) as the cost function in variational approaches with regularization. More precisely, we have
where ${C}_{{y}_{\mathrm{1}:N}}(\mathit{\theta},{u}_{\mathrm{1}:N})$ is the cost function with regularization:
When the prior is Gaussian, the regularization corresponds to Tikhonov regularization. Therefore, the regularized posterior extends the regularized cost function to a probability distribution, with the MAP being the minimizer of the regularized cost function.
The regularized posterior normalizes the likelihood by an exponent 1∕N. This normalization allows for a larger weight (more trust) on the prior, which can then sufficiently regularize the singularity in the likelihood and therefore reduces the probability of nonphysical samples. Intuitively, it avoids the shrinking of the likelihood as the data size increases. When the system is ergodic, the sum $\frac{\mathrm{1}}{N}{\sum}_{n=\mathrm{1}}^{N}\mathrm{log}\left[{p}_{\mathit{\theta}}\right({u}_{n}\left{u}_{n\mathrm{1}}\right)p\left({y}_{n}\right{u}_{n}\left)\right]$ converges to the spatial average $\mathbb{E}\left[\mathrm{log}\right[{p}_{\mathit{\theta}}\left({U}_{n}\right{U}_{n\mathrm{1}}\left)p\right({y}_{n}\left{U}_{n}\right)]$ with respect to the invariant measure as N increases. While being effective, this factor may not be optimal (O'Leary, 2001), and we leave the exploration of optimal regularization factors to future work.
In the sampling of the regularized posterior, we update the state variable U_{1:N} conditional on θ and y_{1:N} by sampling ${p}^{c}\left({u}_{\mathrm{1}:N}\right){p}_{\mathit{\theta}}\left({u}_{\mathrm{1}:N}\right\mathit{\theta},{y}_{\mathrm{1}:N})$ (with ${p}_{\mathit{\theta}}\left({u}_{\mathrm{1}:N}\right\mathit{\theta},{y}_{\mathrm{1}:N})$ specified in Eq. 13) using SMC methods. Compared to the standard PMCMC algorithm outlined in Sect. 2.4, the only difference occurs when we update the parameter θ conditional on the estimated states u_{1:N}. Instead of Eq. (14), we draw a sample of θ from the regularized posterior
The regularized posteriors are approximated by the empirical distribution of samples drawn using particle MCMC methods, specifically PGAS (see Sect. A3) in combination with SMC using optimal importance sampling (see Sect. A2). In the following section, we first diagnose the Markov chain and choose a reasonable chain length for subsequent analyses. We then present the results of parameter estimation and state estimation.
In all the tests presented in this study, we use only M=5 particles for the SMC, as we can be confident of the Markov chain produced by the particle MCMC methods converging to the target distribution based on theoretical results (see Andrieu et al., 2010; Lindsten et al., 2014). In general, the more particles are used, the better the SMC algorithm (and hence the particle MCMC methods) will perform, at the price of increased computational cost.
4.1 Diagnosis of the Markov chain Monte Carlo algorithm
To ensure that the Markov chain generated by PGAS is wellmixed and to find a length for the chain such that the posterior is acceptably approximated, we shall assess the Markov chain by three criteria: the update rate of states; the correlation length of the Markov chain; and the convergence of the marginal posteriors of the parameters. These empirical criteria are convenient and, as we discuss below, have found to be effective in our study. We refer to Cowles and Carlin (1996) for a detailed review of various criteria for diagnosing MCMC.
The update rate of states is computed at each time of the state trajectory u_{1:N} along the Markov chain. That is, at each time, we say the state is updated from the previous step of the Markov chain if any entry of the state vector changes. The update rate measures the mixing of the Markov chain. In general, an update rate above 0.5 is preferred, but a high rate close to 1 is not necessarily the best. Figure 4 shows the update rates of typical simulations for both the Gaussian prior and the uniform prior. For both priors, the update rates are above 0.5, indicating a fast mixing of the chain. The rates tend to increase with time (except for the first time step) to a value close to 1 at the end of the trajectory. This phenomenon agrees with the particle depletion nature of the SMC filter: when tracing back in time to sample the ancestors, there are fewer particles and therefore the update rate is lower. The high update rate at the time t=1 step is due to our initialization of the particles near the equilibrium, which increases the possibility of ancestor updates in PGAS. We also note that the uniform prior has update rates close to 1 at all times, much higher than the rates of the Gaussian prior. Higher update rates occur for the uniform prior because the deviations of parameter samples from the previous values are larger, resulting in an increased probability of updating the reference trajectory in the conditional SMC.
We test the correlation length of the Markov chain by finding the smallest lag at which the empirical autocorrelation functions (ACFs) of the states and the parameters are close to zero.
Figure 5 shows the empirical ACFs of the parameters and states at representative nodes, computed using a Markov chain with length 10 000. The ACFs approach zero within a time lag of around 40 (based on a threshold value of 0.1) for the Gaussian prior, and within a time lag of around 5 for the uniform prior. The smaller correlation length in the uniform prior case is again due to the larger parameter variation in the uniform prior case than the Gaussian prior case.
The relatively small decorrelation length of the Markov chain indicates that we can accurately approximate the posterior by a chain of a relatively short length. This result is demonstrated in Fig. 6, where we plot the empirical marginal posteriors of the parameters, using Markov chains of three different lengths: $L=\mathrm{1000},\mathrm{5000}$, and 10 000. The marginal posteriors with L=1000 are reasonably close to those with L=10^{4}, and those with L=5000 are almost identical to those with L=10^{4}. In particular, the marginal posteriors with L=10^{3} capture the shape and spread of the distributions for L=10^{4}. Therefore, a Markov chain with length L=10^{4} provides a reasonably accurate approximation of the posterior. Hence, we use Markov chains with length L=10^{4} in all simulations from here on. This choice of chain length may be longer than necessary, but allows for confidence that the results are robust.
In summary, based on the above diagnosis of the Markov chain generated by PMCMC, to run many simulations for statistical analysis of the algorithm within a limited computation cost, we use chains with length L=10^{4} to approximate the posterior. For the SMC algorithm, we use only five particles. The number of observations in time is N=100.
4.2 Parameter estimation
One of the main goals in Bayesian inference is to quantify the uncertainty in the parameterstate estimation by the posterior. We access the parameter estimation by examining the samples of the posterior in a typical simulation, for which we consider the scatter plots and marginal distributions, the MAP, and the posterior mean. We also examine the statistics of the MAP and the posterior mean in 100 independent simulations. In each simulation, the parameters are drawn from the prior distribution of θ. Then, a realization of the SEBM is simulated. Finally, observations are created by applying the observation model to the SEBM realization.
The empirical marginal posteriors of the parameters $\mathit{\theta}=({\mathit{\theta}}_{\mathrm{0}},{\mathit{\theta}}_{\mathrm{1}},{\mathit{\theta}}_{\mathrm{4}})$ in two typical simulations, for the Gaussian and uniform priors, are shown in Fig. 7. The top row presents scatter plots of samples along with the true values of the parameters (asterisks) and the bottom row presents the marginal posteriors for each parameter in comparison with the priors.
In the case of the Gaussian prior, the scatter plots show a posterior that is far from Gaussian, with clear nonlinear dependence between θ_{0} and the other parameters. The marginal posteriors of θ_{0} and θ_{1} are close to their priors, with larger tails (to the left for θ_{0} and to the right for θ_{1}). The marginal distribution of θ_{4} concentrates near the center of the prior with a larger tail to the right. The posterior has the most probability mass near the true values of θ_{0} and θ_{1}, which are in the highprobability region of the prior. However, it has no probability mass near the true value of θ_{4} – which is of a low probability in the prior.
In the case of the uniform prior, the scatter plots show a concentration of probability near the boundaries of the physical range. The marginal posteriors of θ_{0} and θ_{1} clearly deviate from the priors, concentrating near the parameter bounds (the upper bound for θ_{0} and the lower bound for θ_{1} in this realization); the marginal posterior of θ_{4} is close to the prior, with slightly more probability mass for large values.
Further tests show that the posterior is not sensitive to changes in the true values of the parameters. This fact is demonstrated in Fig. 8, which presents the marginal distributions for another set of true values of the parameters (but without changing the priors). Though the data change when the true parameters change, the posteriors, in comparison with those in Fig. 7, change little for both cases of Gaussian and uniform prior.
The nonGaussianity of the posterior (including the concentration near the boundaries), its insensitivity to changes in the true parameter, and its limited reduction of uncertainty from the prior (Figs. 7–8) are due to the degeneracy of the likelihood distribution and to the strong regularization. Recall that the degenerate likelihood leads to MLEs with large variations and biases, with the standard deviation of the estimators of θ_{0} and θ_{1} being about 10 times larger than those of θ_{4} (see Fig. 3). As a result, when regularized by the Gaussian prior, the components θ_{0} and θ_{1}, which are more underdetermined by the likelihood, are constrained mainly by the Gaussian prior, and therefore their marginal posteriors are close to their marginal priors. In contrast, the component θ_{4} is forced to concentrate around the center of the prior but with a large tail. While dramatically reducing the large uncertainty of θ_{0} and θ_{1} in the illconditioned likelihood, the regularized posterior still exhibits a slightly larger uncertainty than the prior for these two components.
In the case of the uniform prior, it is particularly noteworthy that the marginal posteriors of θ_{0} and θ_{1} differ more from their priors than the parameter θ_{4}. These results are the opposite of what was found for the Gaussian prior. Such differences are due to the different mechanism of “regularization” by the two priors. The Gaussian prior eliminates the illposedness by regularizing the illconditioned Fisher information matrix with the covariance of the prior. So, the information in the likelihood, e.g., the bias and the correlations between (θ_{0},θ_{1}) and θ_{4}, is preserved in the regularized posterior. The uniform prior, on the other hand, cuts the support of the degenerate likelihood and rejects outofrange samples. As a result, the correlation between θ_{0} and θ_{1} is preserved in the regularized posterior because they feature similar variations, but the correlations between θ_{4} and θ_{0} as well as θ_{1} are weakened (Fig. 7).
In practice, one is often interested in a point estimate of parameters. Commonly used point estimators are the MAP and the posterior mean. Figures 7–8 show that both the MAP and the posterior mean can be far away from the truth for Gaussian as well as uniform priors. In particular, in the case of the uniform prior, the MAP values are further away from the truth than the posterior mean. In the case of the Gaussian prior, the MAP values do not present a clear advantage or disadvantage over the posterior mean.
Table 5a shows the means and standard deviations of the errors of the posterior mean and MAP from 100 independent simulations. In each simulation and for each prior, we drew a parameter sample from the prior and generated a trajectory of observations, and then estimated jointly the parameters and states. The table shows that both posterior mean and MAP estimates are generally biased, consistent with the biases in Figs. 7 and 8. More specifically, in the case of the Gaussian prior, the MAP has slightly smaller biases than the posterior mean, but the two have almost the same variances. Both are negatively biased for θ_{0} and slightly positively biased for θ_{1} and θ_{4}. In the case of the uniform prior, the MAP features biases and standard deviations which are about 50 % larger than those of the posterior mean. Both estimators exhibit large positive biases in θ_{0}, large negative biases in θ_{1}, and small positive biases in θ_{4}.
4.3 State estimates
The state estimation aims both to filter out the noise from the observed nodes and to estimate the states of unobserved nodes. We access the state estimation by examining the ensemble of the posterior trajectories in a typical simulation, for which we consider the marginal distributions and the coverage probability of 90 % credible intervals. We also examine the statistics of these quantities in 100 independent simulations.
We present the ensemble of posterior trajectories at an observed node in Fig. 9 and at an unobserved node in Fig. 10. In each of these figures, we present the ensemble mean with a 1standarddeviation band, in comparison with the true trajectories, superimposed on the ensembles of all sample trajectories at these nodes. We also present histograms of samples at three instants of time: t=20, t=60, and t=100.
Figure 9 shows that the trajectory of the observed node is well estimated by the ensemble mean, with a relative error of 0.7 %. Recall that the observation noise leads to a relative error of about 1 %, so the posterior filters out 30 % of the noise. Also note that the ensemble quantifies the uncertainty of the estimation, with the true trajectory being mostly enclosed within a 1standarddeviation band around the ensemble mean. Further, the histograms of samples at the three time instants show that the ensemble generally concentrates near the truth. In the Gaussian prior case, the peak of the histogram decreases as time increases, partially due to the degeneracy of SMC when we trace back the particles in time. In the uniform prior case, the ensembles are less concentrated than those in the Gaussian case, due to the wide spread of the parameter samples (Fig. 7).
Figure 10 shows sample trajectories of an unobserved node. Despite the fact that the node is unobserved, the posterior means have relative errors of 0.8 % and 3.3 % in the cases of Gaussian and uniform priors, respectively, with a 1standarddeviation band covering the true trajectory at most times. While the sparse observations do cause large uncertainties for both posteriors, the histograms of samples show that the ensembles concentrate near the truth. Particularly, in the case of Gaussian priors, the peaks of the histogram are close to the true states, even when the histograms form a multimodal distribution due to the degeneracy of SMC.
We find that the posterior is able to filter out the noise in the observed nodes and reduce the uncertainty in the unobserved nodes from the climatological distribution. In particular, in the case of the Gaussian prior, the ensemble of posterior samples concentrates near the true state at both observed and unobserved nodes and substantially reduces the uncertainty. In the case of the uniform prior, the ensemble of posterior samples spreads more widely and only slightly reduces the uncertainty.
The coverage probability (CP), the proportion of the states whose 90 % credible intervals contain the true values, is 95 % in the Gaussian prior case and 92 % for the uniform prior in the above simulation. The target probability is 90 %, as in this case 90 % of the true values would be covered by 90 % credible intervals. The values indicate statistically meaningful uncertainty estimates, for example larger uncertainty ranges at nodes with higher mean errors. The slight overdispersiveness, i.e., higher CPs than the target probabilities, might be a result of the large uncertainty in the parameter estimates.
Table 6 shows the means and standard deviations of the relative errors and CPs in state estimation by the posterior mean in 100 independent simulations, averaging over observed and unobserved notes. The relative errors at each time t are computed by averaging the error of the ensemble mean (relative to the true value) over all the nodes. The relative error of the trajectory is the average over all times along the trajectory. The relative errors are 1.14 % and 2.39 %, respectively, for the cases of Gaussian and uniform priors. These numbers are a result of averaging over the observed and unobserved nodes. Note that the relative errors are similar at different times $t=(\mathrm{20},\mathrm{60},\mathrm{100})$, indicating that the MCMC is able to ameliorate the degeneracy of the SMC to faithfully sample the posterior of the states.
In the Gaussian prior case, the CPs are above the target probability in the 100 independent simulations, with a mean of 96 %. This supports the finding from above that the posteriors are slightly overdispersive due to the large uncertainty in the parameter estimates. The standard deviation is very small, with 2 %, which indicates the robustness of the Gaussian prior model. In the uniform prior case, the CPs are much lower, with a mean of 73 %. This might be a result of larger biases compared to the Gaussian prior case which are not compensated by larger uncertainty estimates. In addition, the standard deviation is much higher in the uniform prior case, with 31 %. This shows that this case is less robust than the Gaussian prior case.
5.1 Observing fewer nodes
We tested the consequences of having sparser observations in space, e.g., observing only 2 out of the 12 nodes. In the Gaussian prior case, in a typical simulation with the same true parameters and observation data as in Sect. 4.2, the relative error in state estimation increases slightly, from 0.7 % to 0.8 % for the observed node and from 0.8 % to 1.1 % for the unobserved node. As a result, the overall error increases. The parameter estimates show small but noticeable changes (see Fig. 11): the posteriors of the parameters have slightly wider support and the posterior means and MAPs exhibit slightly larger errors than those in Sect. 4.2.
We also ran 100 independent simulations to investigate sampling variability in the state and parameter estimates. Table 6b reports the means and standard deviations of the relative errors of the posterior mean trajectory, and CPs for state estimation in these simulations. The Gaussian prior case shows small increases in both the means and the standard deviations of errors, as well as slightly lower and less robust CPs. This confirms the results quoted above for a typical simulation. The uniform prior case shows almost negligible error and CP increases. Table 5b reports the means and standard deviations of the posterior means and MAP for parameter estimation in these simulations. Small changes in comparison to the results in Table 5a are found. These small changes are due to the strong regularization that has been introduced to overcome the degeneracy of the likelihood.
5.2 Observing a longer trajectory
When the length N of the trajectory of observation increases, the exponent of the regularized posterior Eq. (19), viewed as a function of θ only, tends to its expectation with respect to the ergodic measure of the system, i.e., $\frac{\mathrm{1}}{N}{C}_{{y}_{\mathrm{1}:N}}(\mathit{\theta},{u}_{\mathrm{1}:N})\stackrel{N\to \mathrm{\infty}}{\mathit{\u27f6}}\mathbb{E}\left[{C}_{{y}_{\mathrm{1}:N}}\right(\mathit{\theta},{u}_{\mathrm{1}:N}\left)\right]$ almost surely. As a result, the marginal posterior tends to be stable as N increases. This result indicates that an increase in data size has a limited effect on the regularized posterior of parameters. This fact is verified by numerical tests with N=1000, in which the marginal posteriors only have a slightly wider support than those in Fig. 7 with N=100.
In general, the number of observations needed for the posterior to reach a steady state depends on the dimension of the parameters and the speed of convergence to the ergodic measure of the system. Here we have only three parameters and the SEBM converges to its stationary measure exponentially (in fewer than 10 time steps); therefore, N=100 is large enough to make the posterior be close to the steady state.
When the trajectory is long, a major issue is the computational cost from sampling the posterior of the states. Note that as N increases, the dimension of the states in the posterior increases, demanding a longer Markov chain to explore the target distribution. In numerical tests with N=1000, the correlation length of the Markov chain is at least 100, about 4 times the correlation length found for N=100. Therefore, to obtain the same number of effective samples as before, we would need a Markov chain with length at least 4 times the previous length, say, $L=\mathrm{4}\times {\mathrm{10}}^{\mathrm{4}}$. The computational cost increases linearly in NL, with each step requiring an integration of the SPDE. The high computational cost, an instance of the wellknown “curse of dimensionality”, renders the direct sampling of the posterior unfeasible. Two groups of methods could reduce the computational cost and make the Bayesian inference feasible. The first group of methods, dynamical model reduction, exploits the lowdimensional structure of the stochastic process to develop lowdimensional dynamical models which efficiently reproduce the statistical–dynamical properties needed in the SMC (see, e.g., Chorin and Lu, 2015; Lu et al., 2017; Chekroun and Kondrashov, 2017; Khouider et al., 2003, and the references therein). The other group of methods approximates the marginal posterior of the parameter by reducedorder models for the response of the data to parameters (see, e.g., Marzouk and Najm, 2009; Branicki and Majda, 2013; Cui et al., 2015; Chorin et al., 2016; Lu et al., 2015; Jiang and Harlim, 2018). In a paleoclimate reconstruction context, the number of observations will generally be determined by available observations and the length of the reconstruction period rather than by computational considerations. We leave these further developments of efficient sampling methods for long trajectories as a direction of future research.
5.3 Estimates of the nonlinear function
One goal of parameter estimation is to identify the nonlinear function g_{θ} (specified in Eq. 2) in the SEBM. The posterior of the parameters also quantifies the uncertainty in the identification of g_{θ}. Figure 12 shows the nonlinear function g_{θ} associated with the true parameters and with the MAPs and posterior means presented in Fig. 7, superposed on an ensemble of the nonlinear function evaluated with all the samples. Note that in the Gaussian prior case, the true and estimated functions g_{θ} are close even though θ_{4} is estimated with large biases by either the posterior mean or by the MAP. In the uniform prior case, the posterior mean has a smaller error than the MAP and leads to a better estimate of the nonlinear function. In either case, the large band of the ensemble represents a large uncertainty in the estimates.
For the Gaussian prior, neither the posterior distribution of the equilibrium state u_{e} (for which g_{θ}(u_{e})=0) nor of the feedback strength dg_{θ}∕du(u_{e}) is substantially changed from the corresponding priors. Both experience only a small reduction of uncertainty. In contrast, the posterior distributions are narrower than the priors for the uniform prior case – although the posterior means and MAPs are both biased.
5.4 Implications for paleoclimate reconstructions
Our analysis shows that assessing the wellposedness of the inverse problem of parameter estimation is a necessary first step for paleoclimate reconstructions making use of physically motivated parametric models. When the problem is illposed, a straightforward Bayesian inference will lead to biased and unphysical parameter estimates. We overcome this issue by using regularized posteriors, resulting in parameter estimates in the physically reasonable range with quantified uncertainty. However, it should be kept in mind that this approach relies strongly on highquality prior distributions.
The illposedness of the parameter estimation problem for the model we have considered is of particular interest because the form of the nonlinear function g_{θ}(u) is not arbitrary but is motivated by the physics of the energy budget of the atmosphere. The fact that wide ranges of the parameters θ_{i} are consistent with the “observations” even in this highly idealized setting indicates that surface temperature observations themselves may not be sufficient to constrain physically important parameters such as albedo, graybody thermal emissivity, or air–sea exchange coefficients separately. While statespace modeling approaches allow reconstruction of past surface climate states, it may be the case that the associated climate forcing may not contain sufficient information to extract the relative contributions of the individual physical processes that produced it. Further research will be necessary to understand whether the contribution of, e.g., a single process like graybody thermal emissivity can be reliably estimated from the observations if regularized posteriors are used to constrain the other parameters of g_{θ}(u).
If the purpose of using the SEBM is to introduce physical structure into the state reconstructions without specific concern regarding the parametric form of g, reparametrization or nonparametric Bayesian inference can be used to estimate the form of the nonlinear function g but avoid the illposedness of the parameter estimation problem. This is an option if the interest is in the posterior of the climate state and not in the individual contributions of energy sink and source processes.
Stateoftheart observation operators in paleoclimatology are often nonlinear and contain nonGaussian elements (Haslett et al., 2006; TolwinskiWard et al., 2011). A locally linearized observation model with data coming from the interpolation of proxy data can be used in the modeling framework we have considered, along with the assumption of Gaussian observation noise. Alternatively, it is also possible to first compute offline pointwise reconstructions by inverting the full observation operator, potentially interpolating the results in time, and using a Gaussian approximation of the pointwise posterior distributions as observations in the SEBM (e.g., Parnell et al., 2016). We anticipate that such simplified observation operators will limit the accuracy of the parameter estimation but that the regularized posterior would still be able to distinguish the most likely states and quantify the uncertainty in the estimation. Directly using nonlinear, nonGaussian observation operators requires a more sophisticated particle filter as optimal filtering is no longer possible. Such approaches will increase the computational cost and face difficulties in avoiding filter degeneracy.
We have investigated the joint stateparameter estimation of a nonlinear stochastic energy balance model (SEBM) motivated by the problem of spatial–temporal paleoclimate reconstruction from sparse and noisy data, for which parameter estimation is an illposed inverse problem. We introduced strongly regularized posteriors to overcome the illposedness by restricting the parameters and states to physical ranges and by normalizing the likelihood function. We considered both a uniform prior and a more informative Gaussian prior based on the physical ranges of the parameters. We sampled the regularized highdimensional posteriors by a particle Gibbs with ancestor sampling (PGAS) sampler that combines Markov chain Monte Carlo (MCMC) with an optimal particle filter to exploit the forward structure of the SEBM.
Results show that the regularization overcomes the illposedness in parameter estimation and leads to physical posteriors quantifying the uncertainty in parameterstate estimation. Due to the illposedness, the posterior of the parameters features a relatively large uncertainty. This result implies that there can be a large uncertainty in point estimators such as the posterior mean or the maximum a posteriori (MAP), the latter of which corresponds to the minimizer in a variational approach with regularization. Despite the large uncertainty in parameter estimation, the marginal posteriors of the states generally concentrate near the truth, reducing the uncertainty in state reconstruction. In particular, the more informative Gaussian prior leads to much better estimations than the uniform prior: the uncertainty in the posterior is smaller, the MAP and posterior mean have smaller errors in both state and parameter estimates, and the coverage probabilities are higher and more robust.
Results also show that the regularized posterior is robust to spatial sparsity of observations, with sparser observations leading to slightly larger uncertainties due to less information. However, due to the need for regularization to overcome illposedness, the uncertainty in the posterior of the parameters cannot be eliminated by increasing the number of observations in time. Therefore, we suggest alternative approaches, such as reparametrization of the nonlinear function according to the climatological distribution or nonparametric Bayesian inference (see, e.g., Müller and Mitra, 2013; Ghosal and Van der Vaart, 2017), to avoid illposedness.
This work shows that it is necessary to assess the wellposedness of the inverse problem of parameter estimation when reconstructing paleoclimate fields with physically motivated parametric stochastic models. In our case, the natural physical formulation of the SEBM is illposed. While climate states can be reconstructed, values of individual parameters are not strongly constrained by the observations. Regularized posteriors are a way to overcome the illposedness but retain a specific parametric form of the nonlinear function representing the climate forcings.
The paper runs synthetic simulations, so there are no data to be accessed. The MATLAB codes for the numerical simulations are available at GitHub: https://github.com/feilumath/InferSEBM.git (last access: 12 August 2019).
A1 Discretization of the SEBM
A1.1 Finiteelement representation in space
We discretize the SEBM in space by finiteelement methods (see, e.g., Alberty et al., 1999). Denote by $\mathit{\left\{}{\mathit{\varphi}}_{i}\right(\mathit{\xi}){\mathit{\}}}_{i=\mathrm{1}}^{{d}_{\mathrm{b}}}$ the finiteelement basis functions, and approximate the solution u(t,ξ) by
The coefficients ${\widehat{u}}_{i}$ are determined by the following weak Galerkin projection of the SEBM Eq. (1):
where ϕ is a continuously differentiable compactly supported test function and the integral ${\int}_{\mathrm{0}}^{t}\langle f(s,\cdot ),\mathit{\varphi}\rangle $ is an Itô integral.
For convenience, we write this Galerkin approximate system in vector notation. Denote
Taking ϕ=ϕ_{j}, $j=\mathrm{1},\mathrm{\dots},{d}_{\mathrm{b}}$ in Eq. (A2) and using the symmetry of the inner product, we obtain a stochastic integral equation for the coefficient $U\left(t\right)\in {\mathbb{R}}^{{d}_{\mathrm{b}}}$:
To simplify notation, we denote the mass and stiffness matrices by
which are symmetric, tridiagonal, positive definite matrices in ${\mathbb{R}}^{{d}_{\mathrm{b}}\times {d}_{\mathrm{b}}}$, and we denote the nonlinear term as
The above stochastic integral equation can then be written as
The mesh on the sphere and the matrices M_{0} and M_{1} are computed with R package INLA (Lindgren and Rue, 2015; Bakka et al., 2018).
A1.2 Representation of the nonlinear term
The parametric nonlinear functional G_{θ}(U(t)) is approximated using the finite elements. We approximate each spatial integration over an element triangle in $\langle {g}_{\mathit{\theta}}\left({U}_{n}^{T}\mathrm{\Phi}\right),\mathrm{\Phi}\rangle $ by the volume of the triangular pyramid whose height is the value of the nonlinear function at the center of the element triangle 𝒯_{k}, i.e.,
where ${\mathit{\xi}}_{k}^{c}$ is the center of the triangle 𝒯_{k}. In the discretized system, we assume that this approximation has a negligible error and take it as our nonlinear functional. In vector notation, it reads
with ${A}_{\mathcal{T}}=\left(\frac{\mathrm{Area}\left({\mathcal{T}}_{k}\right)}{\mathrm{3}}\right)\in {\mathbb{R}}^{{d}_{\mathrm{b}}\times {d}_{\mathrm{e}}}$ with d_{e} denoting the number of triangle elements and the matrix $\mathbf{A}=\left({\mathit{\varphi}}_{i}\left({\mathit{\xi}}_{k}^{c}\right)\right)\in {\mathbb{R}}^{{d}_{\mathrm{e}}\times {d}_{\mathrm{b}}}$, such that the function g_{θ}(AU(t)) is interpreted as an elementwise evaluation. For the nonlinear function g_{θ} in Eq. (2), we can write the above nonlinear term as
where ∘k denotes the entrywise product of the array.
A1.3 Representation of the stochastic forcing
Following Lindgren et al. (2011), the stochastic forcing f(t,ξ) is approximated by its linear finiteelement truncation,
with the stochastic processes $\mathit{\left\{}{f}_{i}\right(t),i=\mathrm{1},\mathrm{\dots},{d}_{\mathrm{b}}\mathit{\}}$ being spatially correlated and white in time. Note that for ν=0.1 and ρ>0 in the Matérn covariance Eq. (4), the process f(t,ξ) is the stationary solution of the stochastic Laplace equation
where W is a spatiotemporal white noise (Whittle, 1954, 1963). Computationally efficient approximations of the forcing process are obtained using the GMRF approximation of Lindgren et al. (2011) which generates $F\left(t\right)\equiv \left({f}_{\mathrm{1}}\left(t\right),{f}_{\mathrm{2}}\left(t\right),\mathrm{\dots},{f}_{{d}_{\mathrm{b}}}\left(t\right)\right)$ by solving Eq. (A14). That is, using the above finiteelement notation, we solve for each time t the linear system
where the random vector $\langle \mathrm{\Phi},W(t,\cdot )\rangle :=\left(\langle {\mathit{\varphi}}_{\mathrm{1}},W(t,\cdot )\rangle ,\mathrm{\dots},\langle {\mathit{\varphi}}_{{d}_{\mathrm{b}}},W(t,\cdot )\rangle \right)$ is Gaussian with mean 0 and covariance M_{0}. Solving Eq. (A15) yields
where ${\mathbf{M}}_{\mathit{\rho}}:=({\mathit{\rho}}^{\mathrm{2}}{\mathbf{M}}_{\mathrm{0}}+{\mathbf{M}}_{\mathrm{1}})$.
A1.4 Semibackward Euler time integration
Equation (A9) is integrated in time by a semibackward Euler scheme:
where U_{n} is the approximation of U(t_{n}) with t_{n}=nΔt, and {F_{n}} is a sequence of iid random vectors with distribution $\mathcal{N}\left(\mathrm{0},{\mathit{\sigma}}_{\mathrm{f}}^{\mathrm{2}}{\mathbf{M}}_{\mathit{\rho}}^{\mathrm{1}}{\mathbf{M}}_{\mathrm{0}}{\mathbf{M}}_{\mathit{\rho}}^{\mathrm{1}}\right)$, with the matrix M_{Δt} denoting
A1.5 Efficient generation of the Gaussian field
It follows from Eq. (A15) that M_{0}F_{n} is Gaussian with mean zero and covariance ${\mathbf{M}}_{\mathrm{0}}{\mathbf{M}}_{\mathit{\rho}}^{\mathrm{1}}{\mathbf{M}}_{\mathrm{0}}{\mathbf{M}}_{\mathit{\rho}}^{\mathrm{1}}{\mathbf{M}}_{\mathrm{0}}$. Note that while M_{ρ} is a sparse matrix, its inverse matrix ${\mathbf{M}}_{\mathit{\rho}}^{\mathrm{1}}$ is not. To efficiently use the sparseness of M_{ρ}, following Lindgren et al. (2011), we approximate M_{0} by ${\widehat{\mathbf{M}}}_{\mathrm{0}}:=\mathrm{diag}(\langle {\mathit{\varphi}}_{i},\mathrm{1}\rangle )$ and compute the noise M_{0}F_{n} by ${\mathbf{C}}^{\mathrm{1}}\mathcal{N}(\mathrm{0},{I}_{d})$, where C is the Cholesky factorization of the inverse of the covariance matrix (called the precision matrix) ${\widehat{\mathbf{M}}}_{\mathrm{0}}^{\mathrm{1}}{\mathbf{M}}_{\mathit{\kappa}}{\widehat{\mathbf{M}}}_{\mathrm{0}}^{\mathrm{1}}{\mathbf{M}}_{\mathit{\kappa}}{\widehat{\mathbf{M}}}_{\mathrm{0}}^{\mathrm{1}}$.The precision matrix is a sparse representation of the inverse of the covariance. Therefore, the matrix C is also sparse and the noise sequence can be efficiently generated.
In summary, we can write the discretized SEBM in the form
where the deterministic function μ_{θ}(⋅) is given by
with ${G}_{\mathit{\theta},k}\left({U}_{n}\right):=\mathrm{\Delta}t{\mathbf{M}}_{\mathrm{\Delta}t}^{\mathrm{1}}{A}_{\mathcal{T}}(\mathbf{A}{U}_{n}{)}^{\circ k}$, and {W_{n}} is a sequence of iid Gaussian noise with mean 0 and covariance R:
A2 SMC with optimal importance sampling
SMC methods approximate the target density ${p}_{\mathit{\theta}}\left({u}_{\mathrm{1}:N}\right{y}_{\mathrm{1}:N})$ sequentially by weighted random samples called particles (hereafter we drop the subindex θ to simplify notation):
with ${\sum}_{m=\mathrm{1}}^{M}{w}_{n}^{m}=\mathrm{1}$. These weighted samples are drawn sequentially by importance sampling based on the recurrent formation
More precisely, suppose that at time n, we have weighted samples $\mathit{\{}{U}_{\mathrm{1}:n\mathrm{1}}^{m},{w}_{n\mathrm{1}}^{m}{\mathit{\}}}_{m=\mathrm{1}}^{M}$. One first draws a sample ${U}_{n}^{m}$ from an easytosample importance density $q\left({u}_{n}\right{y}_{n},{U}_{n\mathrm{1}}^{m})$ that approximates the “incremental density” which is proportional to $p\left({y}_{\mathrm{n}}{u}_{\mathrm{n}}\right)p\left({u}_{\mathrm{n}}\right{U}_{n\mathrm{1}}^{m})$ for each $m=\mathrm{1},\mathrm{\dots},M$ and computes incremental weights
which account for the discrepancy between the two densities. One then assigns normalized weights $\mathit{\{}{w}_{n}^{m}\phantom{\rule{0.125em}{0ex}}\propto \phantom{\rule{0.125em}{0ex}}{w}_{n\mathrm{1}}^{m}{\mathit{\alpha}}_{n}^{m}{\mathit{\}}}_{m=\mathrm{1}}^{M}$ to the concatenated sample trajectories $\mathit{\{}{U}_{\mathrm{1}:n}^{m}{\mathit{\}}}_{m=\mathrm{1}}^{M}$.
A clear drawback of the above procedure is that all but one of the weights $\mathit{\left\{}{w}_{n}^{m}\mathit{\right\}}$ will become close to zero as the number of iterations increases, due to the multiplication and normalization operations. To avoid this, one replaces the unevenly weighted samples $\mathit{\left\{}\right({U}_{n\mathrm{1}}^{m},{w}_{n\mathrm{1}}^{m}\left)\mathit{\right\}}$ with uniformly weighted samples from the approximate density ${\widehat{p}}_{\mathit{\theta}}\left({u}_{n\mathrm{1}}\right{y}_{\mathrm{1}:N\mathrm{1}})$. This is the wellknown resampling technique. In summary, the above operations are carried out as follows:
 i.
draw random indices $\mathit{\{}{A}_{n\mathrm{1}}^{m}{\mathit{\}}}_{m=\mathrm{1}}^{M}$ according to the discrete probability distribution $\mathbb{F}(\cdot {w}_{n\mathrm{1}}^{\mathrm{1}:M})$ on the set $\mathit{\{}\mathrm{1},\mathrm{\dots},M\mathit{\}}$, which is defined as
$$\begin{array}{}\text{(A25)}& \mathbb{F}({A}_{n\mathrm{1}}=k{w}_{n\mathrm{1}}^{\mathrm{1}:M})={w}_{n\mathrm{1}}^{k},\mathrm{for}k=\mathrm{1},\mathrm{\dots},M;\end{array}$$  ii.
for each m, draw a sample ${U}_{n}^{m}$ from $q\left({u}_{n}\right{y}_{n},{U}_{n\mathrm{1}}^{{A}_{n\mathrm{1}}^{m}})$ and set ${U}_{\mathrm{1}:n}^{m}:=({U}_{n\mathrm{1}}^{{A}_{n\mathrm{1}}^{m}},{U}_{n}^{m})$;
 iii.
compute and normalize the weights
$$\begin{array}{}\text{(A26)}& \begin{array}{rl}{\mathit{\alpha}}_{n}^{m}:={\mathit{\alpha}}_{n}\left({U}_{\mathrm{1}:n}^{m}\right)& ={\displaystyle \frac{p\left({U}_{n}^{m}\right{U}_{n\mathrm{1}}^{{A}_{n\mathrm{1}}^{m}}\left)p\right({y}_{n}\left{U}_{n}^{m}\right)}{q\left({U}_{n}^{m}\right{y}_{n},{U}_{n\mathrm{1}}^{{A}_{n\mathrm{1}}^{m}})}},\\ & {w}_{n}^{m}:={\displaystyle \frac{{\mathit{\alpha}}_{n}^{m}}{{\sum}_{k=\mathrm{1}}^{M}{\mathit{\alpha}}_{\mathrm{n}}^{k}}}.\end{array}\end{array}$$
The above SMC sampling procedure is called sequential importance sampling with resampling (SIR) (see, e.g., Doucet and Johansen, 2011) and is summarized in Algorithm 1.
A2.1 Optimal importance sampling
Note that the conditional transition density of the states ${p}_{\mathit{\theta}}\left({u}_{n+\mathrm{1}}\right{u}_{n})$ in Eq. (7) is Gaussian and the observation model in Eq. (8) is linear and Gaussian. These facts allow for a Gaussian optimal importance density $q\left({u}_{n}\right{y}_{n},{U}_{n\mathrm{1}}^{m})$ that is proportional to $p\left({y}_{\mathrm{n}}{u}_{\mathrm{n}}\right)p\left({u}_{\mathrm{n}}\right{U}_{n\mathrm{1}}^{m})$ for each $m=\mathrm{1},\mathrm{\dots},M$:
with the mean ${\mathit{\mu}}_{n}^{m}$ and the covariance Σ given by
A2.2 Drawbacks of SMC
While the resampling technique prevents ${w}_{n}^{m}$ from being degenerate at each current time n, SMC algorithms suffer from the degeneracy (or particle depletion) problem: the marginal distribution $\widehat{p}\left({u}_{n}\right\left({y}_{\mathrm{1}:N}\right))$ becomes concentrated on a single particle as N−n increases because each resampling step reduces the number of distinct particles of u_{n}. As a result, the estimate of the joint density $p\left({u}_{\mathrm{1}:N}\right{y}_{\mathrm{1}:N})$ of the trajectory deteriorates as time N increases.
A3 Particle Gibbs and PGAS
The framework of particle MCMC introduced in Andrieu et al. (2010) is a systematic combination of SMC and MCMC methods, exploiting the strengths of both techniques. Among the various particle MCMC methods, we focus on the particle Gibbs sampler (PG) that uses a novel conditional SMC update (Andrieu et al., 2010), as well as its variant, the particle Gibbs with ancestor sampling (PGAS) sampler (Lindsten et al., 2014), because they are best fit for sampling our joint parameter and state posterior.
The PG and PGAS samplers use a conditional SMC update step to realize the transition between two steps of the Markov chain while ensuring that the target distribution will be the stationary distribution of the Markov chain. The basic procedure of a PG sampler is as follows.

Initialization: draw θ(1) from the prior distribution p(θ). Run an SMC algorithm to generate weighted samples $\mathit{\{}{U}_{\mathrm{1}:N}^{m},{w}_{N}^{m}{\mathit{\}}}_{m=\mathrm{1}}^{M}$ for ${p}_{\mathit{\theta}\left(\mathrm{1}\right)}\left({u}_{\mathrm{1}:N}\right{y}_{\mathrm{1}:N})$ and draw U_{1:N}(1) from these weighted samples.

Markov chain iteration: for $l=\mathrm{1},\mathrm{\cdots},L\mathrm{1}$,

sample θ(l+1) from the marginal posterior $p\left(\mathit{\theta}\right{y}_{\mathrm{1}:N},{U}_{\mathrm{1}:N}\left(l\right))$ given by Eq. (14);

run a conditional SMC algorithm, conditioned on U_{1:N}(l), which is called the reference trajectory. That is, in the SMC algorithm, the Mth particle is required to move along the reference trajectory by setting ${U}_{\mathrm{n}}^{M}={U}_{\mathrm{n}}\left(l\right)$. Draw other samples from the importance density, and normalize the weights and resample all the particles as usual. This leads to weighted samples $\mathit{\{}{U}_{\mathrm{1}:N}^{m},{w}_{N}^{m}{\mathit{\}}}_{m=\mathrm{1}}^{M}$ with ${U}_{\mathrm{1}:N}^{M}={U}_{\mathrm{1}:N}\left(l\right)$; and

draw ${U}_{\mathrm{1}:N}(l+\mathrm{1})$ from the above weighted samples.


Return the Markov chain $\mathit{\left\{}\mathit{\theta}\right(l),{U}_{\mathrm{1}:N}(l){\mathit{\}}}_{l=\mathrm{1}}^{L}$.
The conditional SMC algorithm is the core of PG samplers. It retains the reference path throughout the resampling steps by deterministically setting ${U}_{\mathrm{1}:N}^{M}={U}_{\mathrm{1}:N}\left(l\right)$ and ${A}_{n}^{M}=M$ for all n while sampling the remaining M−1 particles according to a standard SMC algorithm. The reference path interacts with the other paths by contributing a weight ${w}_{n}^{M}$. This is the key to ensuring that the PG Markov chain converges to the target distribution. A potential risk of the PG sampler is that it yields a poorly mixed Markov chain, because the reference trajectory tends to dominate the SMC ensemble trajectories.
The PGAS sampler increases the mixing of the chain by connecting the reference path to the history of other particles by assigning an ancestor to the reference particle at each time. This is accomplished by drawing a sample for the ancestor index ${A}_{n\mathrm{1}}^{M}$ of the reference particle, which is referred to as ancestor sampling. The distribution of the index ${A}_{n\mathrm{1}}^{M}$ is determined by the likelihood of connecting U_{n}(l) to the particles $\mathit{\{}{U}_{n\mathrm{1}}^{m}{\mathit{\}}}_{m=\mathrm{1}}^{M}$, in other words, according to weights
The above weight ${\stackrel{\mathrm{\u0303}}{\mathit{\alpha}}}_{n\mathrm{1}n}^{m}$ can be seen as a posterior probability, where the importance weight ${w}_{n\mathrm{1}}^{m}$ is the prior probability of the particle ${U}_{n\mathrm{1}}^{m}$ and the product ${p}_{\mathit{\theta}(l+\mathrm{1})}\left({U}_{\mathrm{n}}\right(l\left)\right{U}_{n\mathrm{1}}^{m}\left)p\right({y}_{\mathrm{n}}\left{U}_{\mathrm{n}}\right(l\left)\right)$ is the likelihood that U_{n}(l) originates from ${U}_{n\mathrm{1}}^{m}$ conditional on observation y_{n}. In short, the PGAS sampler assigns the reference particle U_{n}(l) an ancestor ${A}_{n\mathrm{1}}^{M}$ that is drawn from the distribution $\mathbb{F}({A}_{n\mathrm{1}}^{M}=k{\stackrel{\mathrm{\u0303}}{w}}_{n\mathrm{1}n}^{\mathrm{1}:M})={\stackrel{\mathrm{\u0303}}{w}}_{n\mathrm{1}n}^{k}.$
The above conditional SMC with ancestor sampling within PGAS is summarized in Algorithm 2.
FL, NW, and AM formulated the project and designed the experiments. NW and AM derived the SEBM, and FL and NW carried out the experiments. FL developed the model code and performed the simulations with contributions from NW. FL prepared the manuscript with contributions from all the coauthors.
The authors declare that they have no conflict of interest.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
The authors thank Colin Grudzien and the other reviewer for helpful comments. This research started in a working group supported by the Statistical and Applied Mathematical Sciences Institute (SAMSI). Fei Lu thanks Peter Jan van Leeuwen, Kayo Ide, Mauro Maggioni, Xuemin Tu, and Wenjun Ma for helpful discussions. Fei Lu is supported by the National Science Foundation under grant DMS1821211. Nils Weitzel thanks Andreas Hense and Douglas Nychka for inspiring discussions. Nils Weitzel was supported by the German Federal Ministry of Education and Research (BMBF) through the Palmod project (FKZ: 01LP1509D). Nils Weitzel thanks the German Research Foundation (code RE39942/1) for funding. Adam H. Monahan acknowledges support from the Natural Sciences and Engineering Research Council of Canada (NSERC), and thanks SAMSI for hosting him in the autumn of 2017.
This research has been supported by the National Science Foundation, USA (grant no. DMS1821211), the German Federal Ministry of Education and Research (BMBF) (grant no. FKZ: 01LP1509D), the German Research Foundation (grant no. RE39942/1), and the Natural Sciences and Engineering Research Council of Canada (NSERC).
This paper was edited by Stefano Pierini and reviewed by Colin Grudzien and one anonymous referee.
Alberty, J., Carstensen, C., and Funken, S. A.: Remarks around 50 lines of Matlab: short finite element implementation, Numer. Algorithms, 20, 117–137, 1999. a
Andrieu, C., Doucet, A., and Holenstein, R.: Particle Markov chain Monte Carlo methods, J. R. Stat. Soc. B, 72, 269–342, 2010. a, b, c, d, e
Annan, J., Hargreaves, J., Edwards, N., and Marsh, R.: Parameter estimation in an intermediate complexity Earth System Model using an ensemble Kalman filter, Ocean Model., 8, 135–154, 2005. a
Apte, A., Hairer, M., Stuart, A., and Voss, J.: Sampling the Posterior: An Approach to NonGaussian Data Assimilation, Physica D, 230, 50–64, 2007. a
Bakka, H., Rue, H., Fuglstad, G. A., Riebler, A., Bolin, D., Illian, J., . and Lindgren, F.: Spatial modeling with R‐INLA: A review, Wiley Interdisciplinary Reviews: Computational Statistics, 10, e1443, 2018. a
Branicki, M. and Majda, A. J.: Fundamental limitations of polynomial chaos for uncertainty quantification in systems with intermittent instabilities, Commun. Math. Sci., 11, 55–103, 2013. a
Cappé, O., Moulines, E., and Ryden, T.: Inference in Hidden Markov Models (Springer Series in Statistics), SpringerVerlag, New York, NY, USA, 2005. a
Carrassi, A., Bocquet, M., Bertino, L., and Evensen, G.: Data assimilation in the geosciences: An overview of methods, issues, and perspectives, WIRES Clim. Change, 9, e535, https://doi.org/10.1002/wcc.535, 2018. a, b
Chekroun, M. D. and Kondrashov, D.: Dataadaptive harmonic spectra and multilayer StuartLandau models, Chaos, 27, 093110, https://doi.org/10.1063/1.4989400, 2017. a
Chorin, A. J. and Tu, X.: Implicit sampling for particle filters, P. Natl. Acad. Sci. USA, 106, 17249–17254, 2009. a
Chorin, A. J. and Lu, F.: Discrete approach to stochastic parametrization and dimension reduction in nonlinear dynamics, P. Natl. Acad. Sci. USA, 112, 9804–9809, 2015. a
Chorin, A. J., Lu, F., Miller, R. M., Morzfeld, M., and Tu, X.: Sampling, feasibility, and priors in data assimilation, Discrete Contin. Dyn. Syst. A, 36, 4227–4246, 2016. a
Cowles, M. K. and Carlin, B. P.: Markov chain Monte Carlo convergence diagnostics: a comparative review, J. Am. Stat. Assoc., 91, 883–904, 1996. a
Cui, T., Marzouk, Y. M., and Willcox, K. E.: Datadriven model reduction for the Bayesian solution of inverse problems, Int. J. Numer. Methods Fluids, 102, 966–990, 2015. a
Doucet, A. and Johansen, A. M.: A tutorial on particle filtering and smoothing: fifteen years later, in: Oxford Handbook of Nonlinear Filtering, 656–704, 2011. a, b, c
Fang, M. and Li, X.: Paleoclimate Data Assimilation: Its Motivation, Progress and Prospects, Sci. China Earth Sci., 59, 1817–1826, https://doi.org/10.1007/s1143001554326, 2016. a
Fanning, A. F. and Weaver, A. J.: An atmospheric energymoisture balance model: Climatology, interpentadal climate change, and coupling to an ocean general circulation model, J. Geophys. Res.Atmos., 101, 15111–15128, 1996. a, b, c
Farchi, A. and Bocquet, M.: Review article: Comparison of local particle filters and new implementations, Nonlin. Processes Geophys., 25, 765–807, https://doi.org/10.5194/npg257652018, 2018. a
Ghosal, S. and Van der Vaart, A.: Fundamentals of nonparametric Bayesian inference, vol. 44, Cambridge University Press, 2017. a, b
Goosse, H., Crespin, E., de Montety, A., Mann, M. E., Renssen, H., and Timmermann, A.: Reconstructing Surface Temperature Changes over the Past 600 Years Using Climate Model Simulations with Data Assimilation, J. Geophys. Res., 115, https://doi.org/10.1029/2009JD012737, 2010. a
Guillot, D., Rajaratnam, B., and EmileGeay, J.: Statistical Paleoclimate Reconstructions via Markov Random Fields, Ann. Appl. Stat., 9, 324–352, 2015. a
Hairer, M., Stuart, A. M., and Voss, J.: Analysis of SPDEs Arising in Path Sampling Part II: The Nonlinear Case, Ann. Appl. Probab., 17, 1657–1706, https://doi.org/10.1214/07AAP441, 2007. a
Haslett, J., Whiley, M., Bhattacharya, S., SalterTownshend, M., Wilson, S. P., Allen, J., Huntley, B., and Mitchell, F.: Bayesian palaeoclimate reconstruction, J. R. Stat. Soc. A, 169, 395–438, 2006. a
Jiang, S. W. and Harlim, J.: Parameter estimation with datadriven nonparametric likelihood functions, arXiv preprint arXiv:1804.03272, 2018. a
Kantas, N., Doucet, A., Singh, S. S., and Maciejowski, J. M.: An Overview of Sequential Monte Carlo Methods for Parameter Estimation in General StateSpace Models, Proceedings of the IFAC Symposium on System Identification (SYSID), SaintMalo, France, 2009. a, b
Khouider, B., Majda, A. J., and Katsoulakis, M. A.: Coarsegrained stochastic models for tropical convection and climate, P. Natl. Acad. Sci. USA, 100, 11941–11946, 2003. a
Law, K., Stuart, A., and Zygalakis, K.: Data Assimilation: A Mathematical Introduction, Springer, 2015. a, b
Lindgren, F. and Rue, H.: Bayesian Spatial Modelling with RINLA, J. Stat. Softw., 63, 1–25, 2015. a
Lindgren, F., Rue, H., and Lindström, J.: An Explicit Link between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach: Link between Gaussian Fields and Gaussian Markov Random Fields, J. R. Stat. Soc. B, 73, 423–498, 2011. a, b, c, d, e
Lindsten, F., Jordan, M. I., and Schön, T. B.: Particle Gibbs with ancestor sampling, J. Mach. Learn. Res., 15, 2145–2184, 2014. a, b, c, d
Liu, J.: Monte Carlo Strategies in Scientific Computing, Springer, 2001. a
Llopis, F. P., Kantas, N., Beskos, A., and Jasra, A.: Particle Filtering for Stochastic Navier–Stokes Signal Observed with Linear Additive Noise, SIAM J. Sci. Comput., 40, A1544–A1565, 2018. a
Lu, F., Morzfeld, M., Tu, X., and Chorin, A. J.: Limitations of polynomial chaos expansions in the Bayesian solution of inverse problems, J. Comput. Phys., 282, 138–147, 2015. a
Lu, F., Tu, X., and Chorin, A. J.: Accounting for Model Error from Unresolved Scales in Ensemble Kalman Filters by Stochastic Parameterization, Mon. Weather Rev., 145, 3709–3723, 2017. a
Marzouk, Y. M. and Najm, H. N.: Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems, J. Comput. Phys., 228, 1862–1902, 2009. a
Maslowski, B. and Tudor, C. A.: Drift Parameter Estimation for InfiniteDimensional Fractional OrnsteinUhlenbeck Process, B. Sci. Math., 137, 880–901, 2013. a
Morzfeld, M., Tu, X., Atkins, E., and Chorin, A. J.: A random map implementation of implicit filters, J. Comput. Phys., 231, 2049–2066, 2012. a
Müller, P. and Mitra, R.: Bayesian nonparametric inference–why and how, Bayesian analysis, 8, 2013. a, b
O'Leary, D. P.: NearOptimal Parameters for Tikhonov and Other Regularization Methods, SIAM J. Sci. Comput., 23, 1161–1171, 2001. a
Parnell, A. C., Haslett, J., Sweeney, J., Doan, T. K., Allen, J. R., and Huntley, B.: Joint palaeoclimate reconstruction from pollen data via forward models and climate histories, Quaternary Sci. Rev., 151, 111–126, 2016. a
Penny, S. G. and Miyoshi, T.: A local particle filter for highdimensional geophysical systems, Nonlin. Processes Geophys., 23, 391–405, https://doi.org/10.5194/npg233912016, 2016. a
Poterjoy, J.: A Localized Particle Filter for HighDimensional Nonlinear Systems, Mon. Weather Rev., 144, 59–76, 2016. a
Prakasa Rao, B. L. S.: Statistical Inference for Stochastic Partial Differential Equations, in: Institute of Mathematical Statistics Lecture Notes – Monograph Series, Institute of Mathematical Statistics, Beachwood, OH, 47–70, 2001. a
Rypdal, K., Rypdal, M., and Fredriksen, H.B.: Spatiotemporal LongRange Persistence in Earth's Temperature Field: Analysis of StochasticDiffusive Energy Balance Models, J. Climate, 28, 8379–8395, 2015. a
Sigrist, F., Künsch, H. R., and Stahel, W. A.: Stochastic Partial Differential Equation Based Modelling of Large SpaceTime Data Sets, J. R. Stat. Soc. B, 77, 3–33, 2015. a
Snyder, C. W.: Evolution of global temperature over the past two million years, Nature, 538, 226–228, 2016. a
Steiger, N. J., Hakim, G. J., Steig, E. J., Battisti, D. S., and Roe, G. H.: Assimilation of TimeAveraged Pseudoproxies for Climate Reconstruction, J. Climate, 27, 426–441, 2014. a
Tingley, M. P. and Huybers, P.: A Bayesian Algorithm for Reconstructing Climate Anomalies in Space and Time. Part I: Development and Applications to Paleoclimate Reconstruction Problems, J. Climate, 23, 2759–2781, 2010. a
Tingley, M. P., Craigmile, P. F., Haran, M., Li, B., Mannshardt, E., and Rajaratnam, B.: Piecing together the past: statistical insights into paleoclimatic reconstructions, Quaternary Sci. Rev., 35, 1–22, 2012. a
TolwinskiWard, S. E., Evans, M. N., Hughes, M. K., and Anchukaitis, K. J.: An efficient forward model of the climate controls on interannual variation in treering width, Clim. Dynam., 36, 2419–2439, 2011. a
Trenberth, K. E., Fasullo, J. T., and Kiehl, J.: Earth's Global Energy Budget, B. Am. Meteorol. Soc., 90, 311–324, 2009. a, b
Van der Vaart, A. W.: Asymptotic statistics, vol. 3, Cambridge university press, 2000. a
VetraCarvalho, S., van Leeuwen, P. J., Nerger, L., Barth, A., Altaf, M. U., Brasseur, P., Kirchgessner, P., and Beckers, J.M.: StateoftheArt Stochastic Data Assimilation Methods for HighDimensional NonGaussian Problems, Tellus A, 70, 1–43, 2018. a, b
Weaver, A. J., Eby, M., Wiebe, E. C., Bitz, C. M., Duffy, P. B., Ewen, T. L., Fanning, A. F., Holland, M. M., MacFadyen, A., Matthews, H. D., Meissner, K. J., Saenko, O., Schmittner, A., Wang, H., and Yoshimori, M.: The UVic Earth System Climate Model: Model description, climatology, and applications to past, present and future climates, Atmos. Ocean., 39, 361–428, 2001. a
Werner, J. P., Luterbacher, J., and Smerdon, J. E.: A Pseudoproxy Evaluation of Bayesian Hierarchical Modeling and Canonical Correlation Analysis for Climate Field Reconstructions over Europe, J. Climate, 26, 851–867, 2013. a
Whittle, P.: On stationary processes in the plane, Biometrika, 41, 434–449, 1954. a
Whittle, P.: Stochastic processes in several dimensions, B. Int. Statist. Inst., 40, 974–994, 1963. a
 Abstract
 Introduction
 Statespace model formulation
 Illposedness and regularized posteriors
 Bayesian inference with regularized posteriors
 Discussion
 Conclusions and future work
 Data availability
 Appendix A: Technical details of the estimation procedure
 Author contributions
 Competing interests
 Disclaimer
 Acknowledgements
 Financial support
 Review statement
 References
 Abstract
 Introduction
 Statespace model formulation
 Illposedness and regularized posteriors
 Bayesian inference with regularized posteriors
 Discussion
 Conclusions and future work
 Data availability
 Appendix A: Technical details of the estimation procedure
 Author contributions
 Competing interests
 Disclaimer
 Acknowledgements
 Financial support
 Review statement
 References