% < \title{Bayesian optimization for tuning chaotic systems}
% ---
% > \title{Bayesian optimization for parametric tuning of chaotic systems}
% 90c90
% <       e.g.,][]{Annan2007,Solonen2012,Hauser2012,Hakkarainen2013}.
% ---
% >       e.g.,][]{Annan2007,Neelin2010,Solonen2012,Hauser2012,Hakkarainen2013}.
% 95c95
% <       measures related to forecast skills, while in climate models, tuning
% ---
% >       measures related to forecast skill, while in climate models, tuning
% 105,110c105,111
% <       heavy simulations of a~complex physical model.  Another difficulty is
% <       that in many cases the objective function is noisy: two evaluations
% <       with the same parameter values lead to different objective function
% <       values.  Such a~situation arises, for instance, when the goal is to
% <       tune stochastic systems, such as ensemble prediction systems used to
% <       quantify uncertainties in numerical weather predictions
% ---
% >       heavy simulations of a~complex physical model. Another difficulty is
% >       that in many cases the objective function is noisy, for example,
% >       when the goal is to optimize ensemble prediction systems with a
% >       stochastic mechanism of ensemble member selection, two evaluations
% >       of the likelihood function with the same parameter values generally
% >       leads to distinct function values, such as ensemble prediction systems
% >       used to quantify uncertainties in numerical weather predictions
% 113,114c114,119
% <       functions, as discussed in \citet{Janne2012}.
% < 
% ---
% >       functions, as discussed in \citet{Janne2012}. Another possible source
% >       of noise is the chaoticity of the tuned model. Small perturbations of
% >       the model parameters can result in significantly different simulation
% >       trajectories and therefore significant differences in the computed
% >       likelihood.
% >       
% 116,119c121,124
% <       (BO) in the problem of parametric tuning of chaotic systems such as
% <       climate models and NWP.  In BO, the parameter values where the
% <       objective function is evaluated are carefully chosen so that we learn
% <       as much as possible about the underlying function.  As a~result, the
% ---
% >       (BO) in the problem of parametric tuning of chaotic systems.
% >       In BO, the parameter values where the objective function is
% >       evaluated are carefully chosen so that we learn as much as
% >       possible about the underlying function.  As a~result, the
% 139,146c144,152
% <       \citep{Marzouk2009} to GP models \citep{Rasmussen2006}, which are also
% <       applied in the BO method. In BO, instead of first building a~surrogate
% <       model and then fixing it for further calculations, the goal is to
% <       design the points where the objective function is evaluated on the fly
% <       so that the potential of the new point in improving the current best
% <       value is maximized. That is, BO is directly built for solving
% <       optimization problems efficiently, not to represent the objective
% <       function efficiently in a~selected region of the parameter space.
% ---
% >       \citep{Marzouk2009} to Gaussian processes (GP) models \citep{Rasmussen2006},
% >       which are also applied in the BO method. In BO, instead of first
% >       building a~surrogate model and then fixing it for further calculations,
% >       the goal is to design the points where the objective function is evaluated
% >       on the fly so that the potential of the new point in improving the
% >       current best value is maximized. That is, BO is directly built for
% >       solving optimization problems efficiently, not to represent the
% >       objective function efficiently in a~selected region of the parameter
% >       space.
% 154,158c160,164
% <       evaluated. We use the Gaussian processes (GP) based BO which has been
% <       previously demonstrated as a~very efficient and flexible approach in
% <       optimization of computationally heavy to compute models in several
% <       papers \citep[see, e.g.,][]{Brochu2010a,Lizotte2012}.
% < 
% ---
% >       evaluated. We use the GP based BO because it has been shown that
% >       it is a~very efficient and flexible approach \citep[see, e.g.,][]{Lizotte2012},
% >       especially, for computationally heavy to compute models 
% >       \citep[see, e.g.,][]{Brochu2010a}.
% >       
% 229c235
% <       distribution over $f$ is chosen such that the function values $\vec{f}
% ---
% >       distribution over $f$ is chosen such that the function values $\vec{f_\theta}
% 233c239
% < &\vec{f}\ |\ \eta\ \sim\ \mathcal{N}( \vec{f} | \vec{0}, \mathbf{K}_{\vec{f}} )
% ---
% > &\vec{f_\theta}\ |\ \eta\ \sim\ \mathcal{N}( \vec{f_\theta} | \vec{0}, \mathbf{K}_{\vec{f}} )
% 239,240c245,246
% <       \eta )$ for the corresponding inputs $\vec{\theta}_i$,
% <       $\vec{\theta}_j$ and hyperparameters $\eta$.  The covariance function
% ---
% >       \eta )$ that depends on $\vec{\theta}_i$, $\vec{\theta}_j$
% >       and hyperparameters $\eta$.  The covariance function
% 248c254
% <   = \sigma^2 \prod_k \exp\left( \frac{-(\vec{\theta}_i-\vec{\theta}_j)^2}{2l_k^2} \right)
% ---
% >   = \sigma_f^2 \exp\left( \frac{-1}{2}(\vec{\theta}_i-\vec{\theta}_j)^T M^{-1} (\vec{\theta}_i-\vec{\theta}_j)} \right)
% 251,253c257,260
% <       where $\sigma^2$ is the scaling parameter which specifies the
% <       magnitudes of the function values and $l_k$ is the parameter defining
% <       the smoothness of the function.  Both belong to hyperparameters
% ---
% >       where $M = \diag(\vec{l})^2$, $l_1,\ldots,l_D$ are the parameter
% >       defining the smoothness of the function in each dimension and
% >       $\sigma_f^2$ is the scaling parameter which specifies the
% >       magnitudes of the function values. Both belong to hyperparameters
% 261c268,274
% <       hyperparameters $\eta$.  Then, GP is used to evaluate the predictive
% ---
% >       hyperparameters $\eta$. The optimal set of parameters for $\eta$
% >       is found using the log marginal likelihood which can be written as
% >       \begin{align}
% >        \log p(\vec{f_\theta}|\vec{\theta},\eta) = -\frac{1}{2}\vec{f_\theta}^T \mathbf{K}_{\vec{f}}^{-1} \vec{f_\theta}
% >        -\frac{1}{2}\log |\mathbf{K}_{\vec{f}}| - \frac{n}{2}\log 2\pi \ .
% >       \end{align}
% >       Then, GP is used to evaluate the predictive
% 267c280
% < &p( f(\vec{\theta}_\text{new}) | \vec{f}, \eta )
% ---
% > &p( f(\vec{\theta}_\text{new}) | \vec{f_\theta}, \eta )
% 272a286
% > \label{eq:meansigma}
% 274c288
% < \left( \mathbf{K}_{\vec{f}} + \vec{\Sigma} \right)^{-1} \vec{f} \\%
% ---
% > \left( \mathbf{K}_{\vec{f}} + \vec{\Sigma} \right)^{-1} \vec{f_\theta} \\%
% 282c296
% <       function, which is often parameterized as $\sigma^2 \mathbf{I}$ and
% ---
% >       function, which is often parameterized as $\sigma_n^2 \mathbf{I}$ and
% 325c339,341
% <       where $\Phi(\cdot)$ is the normal cumulative distribution function.
% ---
% >       where $\mu(\vec{\theta})$ and $\sigma(\vec{\theta})$ are defined in 
% >       \eqref{eq:meansigma}.
% >       $\Phi(\cdot)$ is the normal cumulative distribution function.
% 340c356
% < g_\text{EI}(\vec{\theta}) &= \left< f(\vec{\theta}) - \mu^+ \right> \\%
% ---
% > g_\text{EI}(\vec{\theta}) &= \left< \vec{f_\theta} - \mu^+ \right> \\%
% 476,478c492,494
% <       respectively, $\mathbf{C}_{k-1}^\text{est}$ is the covariance matrix
% <       of the state distribution Eq.~(\ref{eq:s---y}) at time $k-1$ and
% <       $|\cdot|$ denotes the matrix determinant.
% ---
% >       respectively, $\mathbf{C}_{k-1}^\text{est}$ is the estimated
% >       covariance of $p(\vec{s}_k| \vec{y}_{1:k}, \vec{\theta})$
% >       at time $k-1$ and $|\cdot|$ denotes the matrix determinant.
% 488c504
% <       using a~relatively small number of ensembles propagated by the model
% ---
% >       using a~relatively small number of ensemble members propagated by the model
% 493,500c509,521
% < 
% <        From the explanation above, we can see that such likelihood
% <        evaluation techniques using filtering methods is the example of
% <        a~target that is very heavy to evaluate, and thus, efficient
% <        optimization techniques are needed.
% < 
% < 
% < 
% ---
% >        
% >        The explanation above presents the difficulty of tuning a heavy
% >        to compute model using the filtering likelihood techniques. Thus,
% >        highlighting the fact that we require efficient optimization methods
% >        for such problems.
% >        
% >        Note that in our optimization setup, there are two estimation processes
% >        at play. First is the estimation of the the model parameters using the
% >        filtering likelihood technique. Second is a \emph{meta-estimation} process
% >        which optimizes a surrogate model based on the filtering likelihood.
% >        The surrogate is formulated using two statistics given by the GP.
% >        
% >        
% 511,513c532,534
% <       The quasi-geostrophic (QG) model simulates fluid motion dynamics on
% <       a~rotating cylinder \citep[see, e.g.,][]{ECMWF2011}. The chaotic
% <       nature of the dynamics generated by the QG model is shown in
% ---
% >       The quasi-geostrophic (QG) model simulates atmospheric flow
% >       for the geostrophic (slow) wind motions \citep[see, e.g.,][]{ECMWF2011}.
% >       The chaotic nature of the dynamics generated by the QG model is shown in
% 518c539,542
% <       layer.  The system is simulated on a~uniform grid for each layer so
% ---
% >       layer. The geometric domain of the model is specified by a cylindrical
% >       surface vertically divided into two atmospheric layers that can interact
% >       through the interface between them.
% >       The system is simulated on a~uniform grid for each layer so
% 541c565
% <       The relationship between the model physic attributes and parameters
% ---
% >       The relationship between the model physical attributes and parameters
% 564,565c588,589
% < \item layer depths are $D_1 = 6000$ units and $D_2 = 4000$\,units
% < \item distance between the grid points is $100\,000$\,units.
% ---
% > \item layer depths are $D_1 = 6000$ meters and $D_2 = 4000$\,meters
% > \item distance between the grid points is $100\,000$\,meters.
% 572c596
% <       between grid points to be $300\,000$ units. This truncation of the
% ---
% >       between grid points to be $300\,000$ meters. This truncation of the
% 594c618
% <       \sigma^2 \rho
% ---
% >       \sigma_q^2 \rho
% 608c632
% < \rho=\exp\left( -\frac{h_{ij}^2}{2\gamma^2} \right)
% ---
% > \rho=\exp\left( -\frac{h^2}{2\gamma^2} \right)
% 611,615c635,638
% <       where $h_{ij}$ is the distance between the layers if $i$ and $j$ are
% <       in the same layer and $h_{ij}=0$ otherwise. Thus, the actual tuning
% <       parameter is $\gamma$. Parameter $\sigma^2$ is the scaling parameter
% <       and $\tau$ is the nugget term often used to assure numerical
% <       stability.
% ---
% >       where $h$ is the distance between the layers. Thus,
% >       the actual tuning parameter is $\gamma$. Parameter $\sigma_q^2$ is
% >       the scaling parameter and $\tau$ is the nugget term often used to
% >       assure numerical stability.
% 622,624c645
% <       in $x$ and $h$ domains.  The surface direction can be imagined
% <       horizontal to the cylinder shown in Fig.~\ref{qg_geometry} and height
% <       is in the vertical direction.  In order to use a~valid covariance
% ---
% >       in $x$ and $h$ domains.  In order to use a~valid covariance
% 630c651
% <       with the interval of six hours. Thus, the observation operator
% ---
% >       every six hours. Thus, the observation operator
% 649c670
% <       $\sigma^2 \in [0.01\ \ 0.81]$, $\rho \in [0.61\ 0.97]$ and $\tau^2 \in
% ---
% >       $\sigma_q^2 \in [0.01\ \ 0.81]$, $\rho \in [0.61\ 0.97]$ and $\tau^2 \in
% 658c679
% <       function when parameters $\alpha$ and $\log(\sigma)$ are varied and
% ---
% >       function when parameters $\alpha$ and $\log(\sigma_q)$ are varied and
% 683c704
% <       bad samples compared to the initial samples is highly likely due to
% ---
% >       good samples compared to the initial samples is highly likely due to
% 685,686c706,707
% <       Fig.~\ref{fig:maxEI2} that the maximum found with the method gradually
% <       keeps improving over long number of iterations.
% ---
% >       Fig.~\ref{fig:maxEI2} that the best value found upto the current
% >       iteration gradually improves over time.
% 708c729
% < & \frac{\mathrm{d}z_j}{\mathrm{d}t} = - cby_{j+1}(z_{j+2}-z_{j-1}) - cz_j + \frac{c}{b}F_z + \frac{hc}{b}x_{1+\lfloor\frac{j-1}{J}\rfloor}
% ---
% > & \frac{\mathrm{d}z_j}{\mathrm{d}t} = - cbz_{j+1}(z_{j+2}-z_{j-1}) - cz_j + \frac{c}{b}F_z + \frac{hc}{b}x_{1+\lfloor\frac{j-1}{J}\rfloor}
% 738,739c759,761
% <       locations each day.  The last three state variables from every set of
% <       five states are picked and thus we observe the states $3,~5,~8,~ 9,~
% ---
% >       locations each day (one day corresponds to $0.2$ time units).
% >       The last three state variables from every set of
% >       five states are picked and thus we observe the states $3,~4,~5,~8,~ 9,~
% 743c765
% <       corresponds to a~climatological standard.
% ---
% >       corresponds to a~climatological standard deviation.
% 747,748c769,770
% <       $\Delta t = 0.025$ using $50$\,\unit{days} of observations (one day
% <       corresponds to $0.2$ time units).  The tuned system is formulated as
% ---
% >       $\Delta t = 0.025$ using $50$\,\unit{days} of observations.
% >       The tuned system is formulated as
% 756c778,779
% <       The objective function is the likelihood Eq.~(\ref{eq:lik}) computed
% ---
% >       The objective function is the likelihood Eq.~(\ref{eq:lik})
% >       \citep[for details see,][]{Janne2012} computed
% 767,768c790,791
% <       function.  The initial $20$ samples were drawn using the LHS method in
% <       the region $5.0\leq\theta_0\leq7.0$ and
% ---
% >       function.  The initial $20$ samples were drawn using the Latin hypercube
% >       sampling (LHS) method in the region $5.0\leq\theta_0\leq7.0$ and
% 825c848,852
% <       critical. Third, with the exploration vs exploitation property of
% ---
% >       critical. In our experiments, we also observed that the initial sample
% >       set used to construct the surrogate based on the GP approximations was
% >       very important. We selected the initial set of design points several
% >       times in order to achieve the best GP approximation.
% >       Third, with the exploration vs exploitation property of
% 830,831c857,859
% <       ECHAM5 will be the future direction of our research.
% < 
% ---
% >       ECHAM5 will be the future direction of our research, for example,
% >       this approach can be used for tuning four parameters of ECHAM5 that
% >       are related to clouds and precipitation, as done by \citep{Jarvinen2010}.
% 989a1018,1023
% >   
% > \bibitem[{Neelin et~al.(2010)Neelin, Bracco, Luo, McWilliams, and Meyerson}]{Neelin2010}
% > Neelin,~J.~D., Bracco,~A., Luo,~H., McWilliams,~J.~C., and Meyerson,~J.~E.:
% > {Considerations for parameter optimization and sensitivity in climate models},
% > Proceedings of the National Academy of Sciences, 107,
% >   21349-21354, 2010.
% 1076c1110
% <   logarithmic scale for parameter $\sigma$. See text for a~more thorough explanation.}
% ---
% >   logarithmic scale for parameter $\sigma_q$. See text for a~more thorough explanation.}

