A potential implicit particle method for high-dimensional systems

This paper presents a particle method designed for high-dimensional state estimation. Instead of weighing random forecasts by their distance to given observations, the method samples an ensemble of particles around an optimal solution based on the observations (i.e., it is implicit). It differs from other implicit methods because it includes the state at the previous assimilation time as part of the optimal solution (i.e., it is a lag-1 smoother). This is accomplished through the use of a mixture model for the background distribution of the previous state. In a high-dimensional, linear, Gaussian example, the mixture-based implicit particle smoother does not collapse. Furthermore, using only a small number of particles, the implicit approach is able to detect transitions in two nonlinear, multi-dimensional generalizations of a double-well. Adding a step that trains the sampled distribution to the target distribution prevents collapse during the transitions, which are strongly nonlinear events. To produce similar estimates, other approaches require many more particles.


Introduction
Most particle filters perform poorly in very high dimensions.Their ensembles collapse onto a single particle unless the ensemble size grows exponentially with the system dimension.This is a problem of sample impoverishment, and is a manifestation of what Bellman (1957) calls "the curse of dimensionality".
The bootstrap particle filter (BPF; Gordon et al., 1993) is a straightforward method that weighs random solutions of the dynamical model based on their proximity to observations.Even if the model and observation functions are linear and have Gaussian errors, BPF suffers from ensemble collapse as the system dimension increases (Bengtsson et al., 2008;Bickel et al., 2008;Snyder et al., 2008).There is also evidence (Snyder, 2012) that this result is more generally applicable and all particle filters suffer a similar fate.
Both BPF and EnKF begin by generating an ensemble of random model forecasts that are independent of the observations.The resulting estimates are linear combinations of the forecasts, where the coefficients depend on the likelihood that the forecast produced the observations.This paper refers to such methods as explicit, in analogy with the terminology from the numerical solution of differential equations.
Explicit data assimilation methods are prone to errors when the forecast distribution is nearly singular with the distribution conditioned on the observations, also called the target or posterior.For example, this occurs when the model has multiple isolated attracting states and none of the forecasts are in the basin of attraction of the true state (Miller et al., 1999;Evensen and van Leeuwen, 2000).The singularity can be significantly reduced, however, if the stochastic model has an invariant measure (a climatology).This point is the basis of the mean field filter, maximum entropy filter, and related techniques (Eyink and Restrepo, 2000;Kim et al., 2003;Eyink et al., 2004;Eyink and Kim, 2006), which form a parametrized transformation of the model climatology with the same mean, and possibly covariance, as the forecast samples.Nevertheless, there are many stochastic processes that do not have invariant measures, including the ubiquitous Wiener process and a related example considered later in this paper.

B. Weir et al.: A potential implicit particle method for high-dimensional systems
Implicit methods, unlike explicit ones, skip the construction of a forecast distribution and work directly with the target.Their goal is to sample an "optimal" importance distribution whose difference with the target is minimal (Doucet et al., 2000).This approach has a strong theoretical basis and is effective in a variety of contexts, particularly in low to moderate dimensional problems from the geosciences (Chorin and Tu, 2009;Chorin et al., 2010;Morzfeld et al., 2012;Morzfeld and Chorin, 2012;Weir et al., 2013;Atkins et al., 2013).In these applications, the implicit methods require a factor of O(10) to O(100) fewer particles than BPF and EnKF to compute estimates of comparable accuracy (Morzfeld and Chorin, 2012;Weir et al., 2013).
The implicit particle method introduced in this paper avoids ensemble collapse in high dimensions in three ways.First, it forms a mixture model approximation of the background distribution of the state at the previous assimilation time.Second, it uses numerical optimization to find the most probable model solution given the observations and samples around that solution.Third, if the target/posterior distribution is strongly non-Gaussian, it further improves the results by refining the sampled importance distribution to better approximate the target.
While it is possible to improve the estimates of BPF and EnKF significantly, e.g., using Markov chain Monte Carlo resampling methods (Weare, 2009), running-in-place (Kalnay and Yang, 2010), the finite-size EnKF (Bocquet, 2011), and iterative EnKF (Bocquet and Sakov, 2012), only their simplest forms are considered here.Some particle filters (van Leeuwen, 2010;van Leeuwen, 2011;Ades and van Leeuwen, 2013) do, in fact, perform well in high dimensions.Yet it remains unclear if these approaches satisfy the tail decay properties necessary for convergence (Geweke, 1989).Surveys of many other assimilation techniques can be found in the reviews of van Leeuwen (2009) and Bocquet et al. (2010).
The remainder of the paper proceeds as follows.The state estimation problem is introduced next.After that, Sect. 3 describes the mixture-based implicit particle smoother (MIPS) in a general form.This method is applied to a highdimensional example with a linear model and Gaussian statistics in Sect. 4 and to multi-dimensional generalizations of the double-well problem in Sect. 5.Both sections include comparisons with BPF, EnKF, and the implicit particle filter.The final section summarizes the results and conclusions from these examples.Throughout, vectors are written in bold italics, matrices in regular bold, random variables in capital letters and their realizations in lowercase letters.

State estimation
The system state is a stochastic process {X m : m = 0, 1, . ..} in N x dimensions that satisfies the equation where M is a discrete-time dynamical model, {E m } is a (dimensionless) standard normal/Gaussian process, which represents model error, Q is the (dimensional) error covariance matrix and The model dependence on the new state X m+1 is included to account for implicit numerical time discretizations (Kloeden and Platen, 1999).In general, the initial condition is imprecisely known, and the value of its probability density function (pdf) at a realization x 0 is denoted p(x 0 ).
The stochastic model is supplemented with noisy observations at a subsequence {t m(n) : n = 1, 2, . ..} of the model times such that for a given function H, (dimensionless) observation error process {D n } and (dimensional) covariance matrix R. The goal of data assimilation is to efficiently sample from the distribution of model solutions conditioned on a sequence of realizations {y 1 , . . ., y k } of the observations (2) at successive times.The pdf of this stochastic process is denoted p x 0:m(k) | y 1:k , which uses the shorthand z i:j for a given sequence {z i , . . ., z j }.
It is possible to assimilate each new observation and discard it afterward because the target pdf satisfies the recursion relationship This follows from an application of Bayes' theorem, the Markov property of the state, the conditional independence of the observation errors, and a second application of Bayes' theorem.Using the convention that m(0) = 0, and y 1:0 = ∅, Eq. (3) applies if k = 0 as well.
The model error E m and observation error D n need not be Gaussian in general.However, this paper assumes the a priori application of an anamorphosis transformation (Bertino et al., 2003;Weir et al., 2013) to the state and observation so that the corresponding errors are Gaussian random variables.

The effective dimension
The state of nearly every geophysical model is a collection of variables, e.g., velocity, pressure and temperature, evaluated at each point of a grid.Since the number of grid points is often O(10 6 ) or greater, the state dimension is high as well.Fortunately, the effective dimension (Bickel et al., 2008) of the problem is usually much smaller.For example, the model can have a low-dimensional attractor, or states and observations separated by large distances can have negligible correlations.Dimensional reduction takes advantage of the smaller effective dimension by projecting the problem onto the effective subspace.There are a number of different techniques, including dynamically orthogonal decomposition (Sondergaard and Lermusiaux, 2013a, b), localization (many variations exist, but Ott et al., 2004, is (Morzfeld and Chorin, 2012).
The second fundamental assumption of this paper is the a priori application of any possible dimensional reduction, and hence that the eigenvalues of the model and observation error covariance matrices are bounded away from zero.Although the reduced model can have many possible forms, the lowest frequency mode of a climate model is quite often a double-well, i.e., a nonlinear model with two stable fixed points (Majda et al., 2003;Kravtsov et al., 2005).Under the influence of stochastic perturbations, its solutions transition periodically between these two points.It is perhaps the simplest energy balance model capable of reproducing the transitions characteristic of global temperature records (Sutera, 1981).The models considered below combine double-wells and linear maps to extend this scenario to problems in multiple dimensions with multiple attracting states.

The mixture-based implicit particle smoother
The assimilation technique to follow is a modification of the implicit particle filter (IPF) introduced by Chorin and Tu (2009) and extended to parameter estimation by Weir et al. (2013).In the latter, the method is continued sequentially by constructing a kernel density estimate (Silverman, 1986) of the background distribution of the model parameters.In this paper, the previous state x m(k) plays the role of the model parameters.Although it is successful in examples where EnKF fails, the kernel-based implicit approach requires a relatively large number of particles, O(1000), to avoid collapse in a O(10) dimensional sample space.One possibility is that this requirement is primarily due to the deficiencies of the kernel density estimate.
As an alternative to kernel density estimates, Sondergaard and Lermusiaux (2013a, b), referred to as SL13 from now on, suggest using a mixture model (McLachlan and Peel, 2001).The mixture-based implicit particle smoother (MIPS) does exactly that.It differs from the approach of SL13 in two ways: it constructs a mixture approximation of the background of the previous state x m(k) rather than the next state x m(k+1) , and it uses optimization to find probable model solutions rather than using an analysis step based on the Kalman filter.
Figure 1 is a graphical comparison of a one-component mixture model and kernel density estimate of a standard normal density.Even in two dimensions, the kernel density estimate requires O(1000) samples to have any visual similarity with the true density.The mixture model approximation with O(100) samples, on the other hand, is comparable to the true density.Both estimates are quite poor with 10 samples, and their errors only increase as the dimension grows (McLachlan and Peel, 2001;Silverman, 1986).Given just a handful of samples in very many dimensions, it is thus unlikely that any representation of the true density is very accurate.In this case, it is often appropriate to use a one-component mixture model, since the statistical evidence against the Gaussianity of the true distribution is minimal (e.g., the multivariate normality test of Mardia, 1974).

Gaussian mixture models
The assimilation of the (k + 1)-th observation begins with an ensemble of N p particles resulting from the k-th assimilation: Given these samples, one may compute an approximation that is a mixture of N m Gaussian components, where the weight α j , mean µ j and covariance B j of the components are all estimated from the samples.
Here, the only assumption on B j is that it is symmetric and positive-semidefinite.In the case that B j is not positive definite, its inverse is taken as the Moore-Penrose pseudoinverse (Moore, 1920;Penrose, 1951) and its determinant as the product of its non-zero eigenvalues.If B j is an N x × N x matrix of all zeros, then N (µ j , B j ) denotes the Dirac delta function at µ j .
Following SL13, MIPS uses the expectation-maximization (EM) algorithm (Dempster et al., 1977;McLachlan and Krishnan, 2008) to find the maximum likelihood estimate (ML/MLE) of α j , µ j , and B j .At iteration n, the EM update is computed in two steps: T .This is one of many possible choices of clustering methods (see Frei and Künsch, 2013, for an example of an alternative).The number in parentheses is the sample size.The kernel density estimates are computed using the optimal bandwidth and have the same first two moments as the samples (Silverman, 1986).

The number of components
What remains is an approach for specifying the number N m of components in the mixture.At one extreme, where the covariance B has the normalization 1/N p because the EM algorithm finds the MLE.This is a completely parametric representation of the background distribution typical of ensemble implementations of the Kalman filter.At the other extreme, and the densities N (µ j , B j ) are thus Dirac delta functions.This is the traditional particle filter approach, which represents the background as a sum of delta functions, making no parametric assumptions.For simplicity, this paper assumes that N m = 1 until N p is so large that there is no ensemble collapse.After this point, it is safe to take N m = N p .Given a small number of particles in very high dimensions, the method therefore relies upon simplifying parametric assumptions.On the other hand, in the limit as N p → ∞, it maintains the convergence properties of particle filters.An approximation for the number of particles N * p at which to make the transition is derived for linear and Gaussian problems in the following section.The appropriate value of N * p for nonlinear and non-Gaussian problems is not immediately obvious, but a reasonable choice usually can be determined from numerical experimentation.While it is not considered here, picking N m based on the Akaike or Bayes information criteria (SL13; Konishi and Kitagawa, 2008) may have the same convergence properties and be more efficient for finite values of N p .

The target density
Given a mixture model approximation to the background, there is a corresponding approximation to the target.Marginalizing over x 0:m(k)−1 and using the conditional independence of the state and observation, the target is (5) By definition of the model (1) and observation ( 2), and expressions for the conditional pdfs p(x m+1 | x m ) and p(y n | x m(n) ) follow from the change of variables formula.If the time discretization of the model is explicit: Otherwise, Eq. ( 6) is more complex, but the algorithm below remains the same.Substituting the mixture approximation (4) and the expressions ( 6) and (7) into Eq.( 5) gives the approximation of the target, where β j = (2π) N x det B j , and the component cost functions ϕ j are defined such that Finally, to simplify notation, let v denote the vector of independent variables that determine the cost ϕ j , i.e.,    , and ϕ j = ϕ j (v).

Implicit sampling
In general, sampling directly from the component density β −1 j exp(−ϕ j ) is impossible unless ϕ j has one of a limited number of parametric forms.Rather than make this restrictive assumption, implicit techniques use importance sampling (Geweke, 1989), which draws samples from an alternate density, called the importance, then weighs the samples to account for the difference between the actual and sampled densities.
The "optimal" importance is a Gaussian approximation of the component density with the same mode v * j .In general, the mode, which is also the global minimizer of ϕ j , must be found using numerical optimization, a task that is by no means trivial.A byproduct of any quasi-Newton optimization method is an approximation j of the Hessian of ϕ j at v * j , and hence a quadratic approximation ψ j of ϕ j such that where ϕ * j = ϕ j (v * j ) and S j = −1 j .The cost function ψ j is the basis of the "optimal" importance density γ −1 j exp(−ψ j ), where and N v is the dimension of v.The resulting Gaussian mixture approximation of the target density p is The "optimal" importance is actually optimal if the component cost function is a quadratic function of v (Doucet et al., 2000).In this case, the component distribution is Gaussian and, given the exact values of v * j and j , is identical to the "optimal" importance.In other cases, including an example below, there are better choices for the importance.

The algorithm
MIPS begins by determining the importance density for each component in three steps: 1. Compute the mixture model approximation to the background of x m(k) .
2. Find the mode v * j of β −1 j exp(−ϕ j ) and Hessian j of ϕ j at the mode.
3. Define the covariance matrix S j such that S j = −1 j .
It then proceeds as follows for each particle x  i) from the Gaussian importance γ −1 l exp(−ψ l ). 6. Give the sample the weight where l denote the values of the functions evaluated at v (i) .
Afterward, the weighted samples can be transformed into a uniformly weighted ensemble by resampling with replacement, also known as the bootstrap (Efron, 1979).Since this step adds noise to the estimates, it is best to resample only when the effective sample size (Kong et al., 1994;Liu, 1996;Doucet et al., 2000) falls below a given fraction of N p .There are a variety of improvements to the bootstrap that reduce the added noise as well.This reduction, however, is likely dominated by the error in the representation of the target (Kitagawa, 1996).Although it is not presented here, the generalization of the EM algorithm to weighted samples is straightforward.
In two special cases, MIPS is equivalent to other assimilation techniques.First, if there is a single component and everything is linear and Gaussian, MIPS is an ensemble Kalman smoother.If the model or observation functions are nonlinear or if N m = 1, it is not an ensemble Kalman smoother (for two methods that are, see van Leeuwen and Evensen, 1996;Evensen and van Leeuwen, 2000).Second, if N m = N p , MIPS is equivalent to IPF.This is because the covariance of each mixture component is the degenerate form B j = 0 (see Sect. 3.2), which fixes the value of x m(k) for each particle.
Another variation to the above algorithm that can reduce the variance of the weights is to perform importance sampling on the full mixture distribution rather than its individual components.If N m = N p , this approach is equivalent to an implementation of the marginal particle filter (Klaas et al., 2005).Its biggest drawback is that it must evaluate every component density, rather than just one, at every sample to compute the weights.

A linear and Gaussian example
As a simple demonstration of ensemble collapse in high dimensions, Bengtsson et al. (2008), Bickel et al. (2008), andSnyder et al. (2008) propose an example where there are observations every time step, i.e., m(n) = n, the model and observation are linear functions such that and the distribution of the initial condition is Gaussian.
By construction, the difference between components in the mixture model is the only source of variability in the weights.This is because the component cost functions, are quadratic and the component densities are Gaussian.It is thus possible to sample the component densities exactly.The resulting samples have weights (8) that satisfy which depends only on the component of the mixture (recall that l is a random function of i).
Expressions for the mean and covariance of the component densities, and hence the weights, follow from similar algebra to the Kalman smoother (Rauch, 1963;Jazwinski, 1970).Dropping the component index l to simplify notation, the mode is the solution of the linear equations It can be expressed as the backward recursion, where At every point, the Hessian of the cost function is the block matrix Its inverse, the covariance matrix is where the matrices are again the same as in the Kalman smoother: Substituting the expressions back into Eq.( 8) and restoring the index l shows that where ỹl = y k+1 − HAµ l , and l = HP In other words, the weights are the values of the pdf of the innovations ỹl , whose covariance is l .

Ensemble collapse
A fundamental result of Bengtsson et al. (2008), Bickel et al. (2008), andSnyder et al. (2008) is that for importance sampling methods where To simplify the analysis, Snyder (2012) takes and a standard normal prior on the initial condition.He then shows that, provided N x 1, These two cases are depicted in Fig. 2 for an example where a 2 = q 2 = 0.5.The line for MIPS with a one component mixture (MIPS1) is identically zero because all of the terms in Eq. ( 8) are constant for this example: the first two terms are constant because there is only one component in the mixture, and the final term is always 1 because the component cost ϕ and its quadratic approximation ψ are identical.
It is also evident from Fig. 2 how to choose N m in a simple way to avoid collapse.If N p is less than the value of the IPF line, let N m = 1.Otherwise, let N m = N p to revert to using IPF.The threshold N * p at which to make this change occurs when E 1 max i w (i)  = 1/0.9.Fig. 2. Theoretical lines such that E[1/ max i w (i)  ] = 1/0.9for the (solid) bootstrap particle filter, (dashed) implicit particle filter and (dot-dashed) mixture-based implicit smoother with N m = 1.The dot-dashed curve is identically zero.Similar to Fig. 3 of Snyder ( 2012) with a corrected y axis label.
Substituting this into Eq.( 9), neglecting the higher-order terms, and using the IPF expression for σ 2 gives In practice, computational resources may necessitate taking N m = 1 even well beyond this threshold, maintaining a transition to N m = N p only to preserve the theoretical properties of the particle filter.Often, even for nonlinear models, the choice N m = 1 performs quite well, as is shown below.Table 1 provides a numerical comparison to the theoretical lines in Fig. 2. In general, the number of particles necessary to avoid collapse is greater than the transition value yet by a factor less than 2. Furthermore, the difference between the two goes to zero as N x → ∞.The reason for the discrepancy is that the convergence in Eq. ( 9) is quite slow (David and Nagaraja, 2003, Ex. 10.5.3).While more accurate thresholds are possible, in realistic applications, the model and observation nonlinearities are likely to have a significant effect on the weights and thus the choice of N * p .

Multiple-well problems
This section considers three examples: a standard, onedimensional double-well and two generalizations of it to multiple dimensions.Although it is simple, the nonlinearity of the double-well is significant enough to cause difficulties  ] on N p and N x for the implicit particle filter.The number of particles N p varies along the rows and the state dimension N x along the columns.Estimates are computed from 1000 trials, and typical sampling errors are O(0.01).The italic values are the closest points to the dashed line in Fig. 2. with data assimilation methods that rely on parametric assumptions about the target, notably the extended Kalman filter (Miller et al., 1994) and EnKF (Miller et al., 1999;Evensen and van Leeuwen, 2000).For every example, the discrete model M is the result of the Euler-Maruyama method (Kloeden and Platen, 1999) applied to a continuous model f , i.e., the time step τ is 0.02, observations occur every 200 time steps, Q = 0.5τ I x , the observation operator is the identity, R = 0.1I x and the initial condition for every component of the state has mean 1 and standard deviation 0.1.These values are roughly equivalent to those used by SL13.In the doublewell problem, N x = 1, and The multiple-well examples have one of two possible forms: where x i denotes the i-th element of the vector x.
The models for the two multiple-well examples ( 10) and ( 11) are intended to illustrate two different possibilities for the asymptotic statistics of the sample solutions.Projections into three dimensions of both types of sample solutions are depicted in Fig. 3.In the first example (10), the deterministic equations have 8 stable fixed points.The corresponding stochastic equations have an invariant measure (climatology) that is approximately the sum of 8 Gaussians centered at these points.The pdf of this measure is the limit as t → ∞ of the solution of the Fokker-Planck equation (also known as the Kolmogorov forward equation; Øksendal, 2003).In the second multiple-well example (11), the deterministic equations have two stable, two-dimensional invariant sets in the limit τ → 0.
The solution of the corresponding Fokker-Planck equation, like that of a two-dimensional Wiener process, is a measure whose variances in these two subsets go to infinity as time increases.Consequently, the stochastic equations have no invariant measure.
The non-standard form of the continuous model f in the second multiple-well example ( 11) is meant to aid comparison with BPF and EnKF, which typically only apply to explicit time discretizations (1).It follows from applying backward/implicit Euler (Kloeden and Platen, 1999) to a rotation map.This ensures the stability of the discrete model.Implicit sampling methods, on the other hand, are straightforward to implement with implicit time discretizations.The only difference is in the form of the cost function ϕ j .This is a notable advantage of the implicit approach.

Transition detection
Figure 4 compares the results of EnKF, BPF, the implicit particle filter (IPF) and the mixture-based implicit particle smoother with a 1 component mixture model (MIPS1) using 10 particles for the double-well problem.Both implicit methods perform well, yet MIPS1 requires just over a tenth of the number of floating point operations of IPF.This is because it solves a single optimization problem instead of 10, while including the previous state x m(k) in the optimization problem only increases the dimension of the problem from 200 to 201.
Both BPF and EnKF miss the transition even after repeated observations because their forecasting step rarely generates a particle in the correct well.This is evident in the BPF and EnKF estimates in Fig. 4, which display noticeably less variability while in the correct well than the implicit estimates.As a result, artificial inflation of the model error covariance (Anderson and Anderson, 1999) could improve the performance of the explicit methods in these examples.
The estimates of every method for the multiple-well problems are qualitatively similar to a combination of the doublewell problem with the linear problem.In the dimensions where the model f is nonlinear, the problem resembles the double-well, and in the dimensions where the model f is linear, the problem resembles the linear example.As the state dimension N x increases, EnKF and BPF are increasingly less likely to detect the transitions from well to well.Table 2 quantifies this likelihood for the first example (some of the variability in the results for BPF is most likely due to sampling errors).Results for the second example are comparable.Unlike the explicit methods, the implicit methods consistently detect the transition because they find an optimal solution based on the observation.In these examples, the model and observation functions and covariances have a particularly simple form, and the problem can be decomposed into a collection of decoupled problems.If this decomposition is used, the performance of the explicit methods does not degrade as the dimension increases.Nevertheless, this decomposition is only possible in very special cases.

Hessian refinement
Although the implicit methods consistently detect the transition from one well to another, their weights collapse at the transition.This happens in MIPS1 because the component cost ϕ, is far from its quadratic approximation ψ, resulting in significant variation in the term exp ψ (i)  − ϕ (i) .
As a simple example, suppose ϕ(x) = κx 2 + x 4 for some small positive number κ.Then ψ(x) = κx 2 , and importance sampling based on ψ will generate very many samples far out on the tails of the target density and very few in the region of high probability.Moreover, for a fixed N p , the weight of the sample closest to the origin approaches 1 as κ approaches 0.
In many examples like the above, it is possible to decrease the variance of the weights by finding a better approximation to the covariance of the component density than the inverse Hessian.One approach is to repeatedly sample the importance density and update the Hessian based upon the agreement of the component cost ϕ and importance cost ψ.This results in a variation of the MIPS algorithm with the third step (see Sect. 3.5) replaced by the refinement step, 3 .Begin with (1) equal to the Hessian at the mode.
For n from 1 to a given number N r , draw a sam- ple v (n) from the Gaussian importance γ −1 exp(−ψ).Then compute a new Hessian (n+1) , where , and update the covariance matrix S to be the inverse of (n+1) .
While it is not presented here, the extension of the refinement to N m > 1 is straightforward.The iteration on the Hessian follows from an application of the stochastic gradient descent algorithm of Robbins and Monro (1951).Sufficient conditions for its convergence are that n is positive and, as n → ∞, The final limit condition, however, can be weakened considerably (Kushner and Yin, 2003, Chap. 5).The result of the iteration is an approximation of Hessian whose inverse is the covariance that minimizes the variances of the weights, a claim which is made precise in Appendix A. It is apparent that sampling from a Gaussian whose cost function is based on the refined Hessian is far more efficient than if the cost function were based on the local approximation.Finally, the similarity of the results with the example ϕ(x) = κx 2 + x 4 is due to the fact that the transition is the point in the assimilation where the nonlinear terms of the function f matter most.

Conclusions
A mixture model approximation of the distribution of the background state enables particle methods to adjust the background position of the particles and is often more accurate than a kernel density estimate.When combined with an implicit assimilation method, this approach, the mixture-based particle smoother (MIPS), is a possible solution for high dimensional problems.This is true for a high-dimensional, linear, Gaussian example, where MIPS does not collapse.
With only a small number of particles, the implicit method is able to detect transitions in an example with multiple attracting states.To detect the same transitions, explicit approaches like BPF and EnKF require considerably more particles.This number increases with the system dimension, provided the problem does not admit further dimensional reduction.Moreover, with the addition of an iteration that trains the proposal covariance to the true covariance, MIPS can track transitions without weight collapse given only a handful of particles.
If MIPS is to be applied to a realistic, high-dimensional assimilation problem in the geosciences, there are a number of improvements and simplifications to consider.In particular, with a limited number of samples in very high dimensions, the analytically computed values of the mode v * j and covariance S j may lead to better approximations of the component mean µ j and covariance B j than the sample estimates.Perhaps most importantly, the optimization step in MIPS can require very many floating point operations, and its efficiency is vital to the applicability of the method as a whole.However, the examples in this paper show that there is reason to believe this additional computational requirement enables the implicit approach to produce accurate estimates in high dimensions even with a very small number of particles.

Minimization of the variance of the weights
In general, the variance of the weights with respect to the density q (v | y 1:k+1 ), var q [w] = E q w 2 − E q [w] 2 , (A1) measures the success of an importance sampling method.It determines both the effective ensemble size and, to leading order, the variance of the sample mean of a general function (provided it satisfies appropriate integrability conditions; Kong et al., 1994;Liu, 1996;Doucet et al., 2000).
The goal of the Hessian refinement is to find an approximation of the matrix that minimizes the variance of the weights (A1).Since the partition function E q [w] is independent of , By definition, At this point, it is possible to apply the Robbins-Monro iteration (Robbins and Monro, 1951) to the integral objective equation However, this approach performs poorly in practice because of the exponential dependence of the weights on the difference between the model cost and quadratic cost, which is very often large in magnitude.The approximation of the minimizer follows from expanding the square of the weights into the first term of which is zero by definition of the covariance matrix.Hence, up to O v − v * 4 , the Hessian that minimizes the variance of the weights satisfies the integral objective equation Since the objective equation ( A2) is a single equation for N x (N x + 1)/2 unknowns, it has infinitely many solutions.Nevertheless, each sample v (i) only contains information about the model cost ϕ in the direction of v (i)  − v * .Similar to quasi-Newton methods from deterministic optimization, the random-direction approach makes a rank-1 update to the Hessian such that (i+1) This leaves the effect of the Hessian on the null space of v (i)  − v * unchanged.However, it is important to truncate the step if it would cause one of the eigenvalues of (i+1) to become negative.Although no formal proof is added, there is no evidence against the requirements for convergence (Kushner and Yin, 2003).

Fig. 1 .
Fig. 1.Continuous estimates of the density of a two-dimensional standard normal random variable.Contours are plotted at equal intervals of the logarithm of the (thin circles) true density and (thick curves) estimated densities.The estimates are the (a) Gaussian with sample mean and covariance and (b) kernel density estimate from the same samples.The number in parentheses is the sample size.The kernel density estimates are computed using the optimal bandwidth and have the same first two moments as the samples(Silverman, 1986).
a uniform random number r ∈ (0, 1], and find the component index l that satisfies using the convention that the sum from j = 1 to j = 0 is zero.et al.: A potential implicit particle method for high-dimensional systems 5. Draw a sample v ( a − P f C T , and I x is the N x × N x identity matrix.

Fig. 3 .
Fig. 3. Examples of three-dimensional projections of the twin solution for the two multiple-well examples: (a) 8 wells with three transitions and (b) two wells with one transition.The number of transitions in the left panel is very rare and is used for visualization purposes.

Fig. 3 .Fig. 4 .
Fig. 3. Examples of three-dimensional projections of the twin solution for the two multiple-well examples: (a) 8 wells with three transitions and (b) two wells with one transition.The number of transitions in the left panel is very rare and is used for visualization purposes.

Fig. 5 .
Fig.5.Plots of the true cost function and its quadratic approximations along a line in sample space.The direction of the line is parallel to the eigenvector of the Hessian with the smallest eigenvalue, and the mode is translated to 0. The curvature of the local approximation (dashed) is determined by the second derivative of the cost function at its minimum, while the curvature of the refined approximation (dot-dashed) is adapted to better reflect the global properties of the cost function.

Table 1 .
The dependence of E[1 max i w(i)

Table 2 .
Percentage of trials in which an estimate computed with 10 particles is in the same well as the twin solution at the final assimilation time.The well is determined by the sign of the elements of the state vector.Results are computed from 100 trials each with at least one transition.