An initial dimension reduction forms an integral part of many analyses in
climate science. Different methods yield low-dimensional representations
that are based on differing aspects of the data. Depending on the features of
the data that are relevant for a given study, certain methods may be more
suitable than others, for instance yielding bases that can be more easily
identified with physically meaningful modes.
To illustrate the distinction between particular methods and identify
circumstances in which a given method might be preferred, in this paper we
present a set of case studies comparing the results obtained using the
traditional approaches of empirical orthogonal function analysis and

A ubiquitous step in climate analyses is the application of an initial dimension reduction method to obtain a low-dimensional representation of the data under study. This is, in part, driven by the purely practical fact that large, high-dimensional datasets are common, and to make analysis feasible, some initial reduction in dimension is required. Often, however, we would like to associate some degree of physical significance with the elements of the reduced basis, for instance by identifying separate modes of variability. Given the wide variety of possible dimension reduction methods to choose from, it is important to understand the strengths and limitations associated with each for the purposes of a given analysis.

Perhaps the most familiar example in climate science is provided by an empirical orthogonal function (EOF;

Another approach to constructing interpretable representations is based on
cluster analysis, which, in its simplest variants, identifies regions
of phase space that are repeatedly visited

The output of clustering algorithms such as

A simple example
would be a naïve analysis of noisy annular data, for which

Unlike PCA and standard clustering methods, AA has only relatively recently
found use in climate studies

Probabilistic formulations of PCA and AA have also been
proposed

Ultimately though, a fuller understanding might be best obtained by using
generalizations of the above methods or combinations of multiple methods. AA can be regarded as a constrained convex encoding

The purpose of this paper is to explore some of the above issues in the
context of climate applications, in the hope that this may provide a useful
aid for researchers in constructing their own analyses. In particular, we aim
to illustrate some of the strengths and weaknesses of PCA and other dimension
reduction methods, using as examples

The remainder of this paper is structured as follows. In Sect. 2 we review the dimension reduction methods used. In Sect. 3, we compare the results obtained using each method in a set of case studies to illustrate the distinctions between the methods. Finally, in Sect. 4 we summarize our observations and discuss possible future extensions.

In this section, we first describe the dimension reduction methods that we
use in our case studies. As noted above, PCA,

In the following,
we use notation appropriate for a time series of observations with separate samples indexed by time

The factors

The

PCA, in its synthesis formulation, is equivalent to minimizing an

The partitioning that results from

See

The convex codings that we employ below, like PCA and

For

Summary of the definitions of each of the four
methods compared in this study. Each method is defined by a choice of cost function to be minimized together with a set of constraints placed on the factors

The various choices of cost function and constraints defining the above
methods are summarized in Table

Illustration of the different
decompositions obtained using

We now turn to a set of case studies that demonstrate some of the implications of the various differences noted above in realistic applications.
We consider two particular examples that highlight the importance of considering
the particular physical features of interest when choosing among possible
dimension reduction methods. The first example that we consider, an analysis of SST
anomalies, is characterized by a large separation of scales between modes of
variability together with key physical modes, particularly ENSO, that can be
directly related to extreme values of SST anomalies. This means that the basis
vectors or spatial patterns extracted by PCA,

Following

As the standard and most familiar method, we first perform PCA on
the SST anomalies over the time period January

Of course, in practice proper estimates of the out-of-sample performance would be obtained by an appropriate cross-validation procedure or similar. However, as here we are primarily interested in the qualitative differences between the different methods in terms of extracting recognizable states, we do not focus on the technical details of model selection or optimizing predictive performance and simply present these out-of-sample estimates to show the general features of each method.

, the EOFs and PCs are evaluated on the firstFraction of variance explained by the first

With the above EOF patterns as a point of reference, we now turn to comparing
the representation of the dataset produced by each of

Spatial patterns for the leading

We consider the results of a

We note that, when
using a null model generated by PCA, the gap curve in Fig.

Plots of the normalized within-cluster sum of squares

The partitioning provided by a simple clustering method such as

Spatial patterns of the cluster centroids obtained from a

Two-dimensional projection of HadISST
SST anomalies obtained by metric MDS with a Euclidean distance measure.
The assignment of each point to the clusters produced by

The fact that, due to the dominating role of ENSO variability, a

Similar remarks can be made
for the AA/PCH-

Spatial patterns of the archetypes
found from AA with

Spatial patterns for the convex
coding dictionary vectors for

Two-dimensional projection of basis vectors
obtained by AA, convex coding, and

This trade-off between reproducing the data with small errors and
constraining the basis vectors to be close to the observed data is also
evident in Fig.

Training and test set RMSE for the reconstruction of SST anomalies resulting from each of the methods.

SST anomalies are an example of a dataset in which a dominant mode is well separated in scale from subleading modes of variability. As a result, all of the dimension reduction methods that we consider extract similar bases (patterns) with which to represent the data, and these can be identified with well-known physical modes. Moreover, physically interesting events such as extremes correspond to large-magnitude anomalies relative to the mean state, i.e., at the boundary of the point cloud, and so can be directly extracted by those methods that look for dictionary vectors in the convex hull of the data. This is not true for many variables of interest, however, and so we now compare the behavior of the methods when applied to data that do not exhibit these features.

We consider Northern Hemisphere (NH) daily mean anomalies of

Fraction of the total variance associated with
each of the first

Spatial patterns of geopotential height
anomalies corresponding to the

In the case of NH geopotential height anomalies, the cluster centroids,
archetypes, and dictionary basis vectors that result from applying

Clustering on the PCs was done so as to reduce the overall cost of the methods; we have checked that, for small numbers of clusters, the spatial patterns that result are very similar.

ofSimilar behavior is
evident for different numbers of states; e.g., for

Two-dimensional projection of spatial patterns of geopotential height anomalies obtained using the various dimension reduction methods by metric MDS. The results of transforming line segments lying along the directions of the first
and second EOFs, as in Fig.

While all the methods identify a feature that is strongly reminiscent of the North Atlantic Oscillation, there is somewhat more variation in the remaining representative states, in contrast to the case of SST anomalies. In particular, the centroids
identified by

In the absence of any regularization, the patterns obtained by a convex
coding and by AA correspond to extreme departures from the mean state but cannot necessarily be directly interpreted as individually
representing particular physical extremes. As noted above, this arises due to
the fact that such extremes are not necessarily associated with boundaries in
state space but may instead be due to extended residence or persistence of a
given (non-extreme) state. This difficulty in directly relating atmospheric
extremes to a basis produced by AA or similar methods can be clearly demonstrated
by considering the representation of a given event in terms of this basis.
A dramatic example is provided by the 2010 summer heatwave in western Russia
that saw an extended period of well-above-average daily temperatures and poor air quality and was associated with substantial excess mortality and economic
losses

Time series of basis weights (lines) associated with each archetype produced by AA and the states produced by
convex coding with and without regularization, for

In contrast, despite the severity of this event,
individual daily height anomalies during July 2010 are not unambiguously
identified as extremes by AA or the less constrained convex coding.
In Fig.

Finally, it is worth noting that, as in the SST case study, the ambiguous classification of events provided by the convex-hull-based methods is less of a problem when state space location alone (e.g., temperature anomalies) is in itself a relevant feature. When this is not the case, as here, methods that take advantage of state space density, either due to recurrence or persistence, may be more easily interpreted, or alternatively hybrid approaches could be used in order to partition the state space.

Representing a high-dimensional dataset in terms of a highly reduced basis
or dictionary is an essential step in many climate analyses. Beyond the
practical necessity of doing so, it is usually also desirable for the
individual elements of the representation to be identifiable with physically
relevant features for the sake of interpretation. A wide range of popular
dimension reduction methods, including PCA,

In some cases, the representations obtained using different dimension reduction
methods are very similar, and one can identify more or less easily interpretable
features using any given method. This is exemplified by our first case study
of SST anomalies, in which the presence of the dominant ENSO mode ensures that
PCA,

As our second case study demonstrates, neglecting the important role played
by temporal persistence in dynamically relevant features can lead to
representations that are difficult to interpret and may not be as effective
for studying persistent states. Clustering-based approaches, or more generally
methods that attempt to approximate modes in the PDF rather than targeting the
tails of the distribution, are likely to be a better choice in these
circumstances. This can also be achieved by appropriate regularization so as
to reduce sensitivity to transient features or outliers, which otherwise drive
the definition of the basis in methods such as AA. In all of the methods
that we have considered, a lack of independence in time is not explicitly
modeled. Extensions to the simple methods that we have considered to account
for non-independence are also possible, albeit usually at the cost of increased
complexity. Singular spectrum analysis (see, e.g.,

The idea of imposing temporal regularization via assumed dynamics
for the latent weights suggests that another approach to better target
particular features is to start from an appropriately defined generative model or otherwise explicitly incorporate appropriate time dependence when constructing a reduction method. An underlying probabilistic model is already suggested by the
stochastic constraints that are imposed on the weights in AA and in convex coding,
a feature that is already taken advantage of in the case of the scalable
probabilistic approximation.
Corresponding latent variable models can naturally be constructed for
PCA

As noted in Sect.

where, for simplicity, we have assumed that the derivative of the
penalty term

The HadISST SST dataset used in this study is provided by the
UK Met Office Hadley Centre and may be accessed at

All of the authors designed the study. TJO'K proposed the specific case studies, and DH implemented the code to perform the numerical comparisons and generated all plots and figures. All of the authors contributed to the direction of the study, discussion of results, and the writing and approval of the manuscript.

The authors declare that they have no conflict of interest.

The authors would like to thank Illia Horenko for continual guidance and for his valuable inputs over the course of this project and Vassili Kitsios and Didier Monselesan for encouragement and helpful discussions about the various methods.

This research has been supported by the Australian Commonwealth Scientific and Industrial Research Organisation (CSIRO) through a ResearchPlus postdoctoral fellowship and by the CSIRO Decadal Climate Forecasting Project.

This paper was edited by Zoltan Toth and reviewed by two anonymous referees.