Toward enhanced understanding and projections of climate extremes using physics-guided data mining techniques

. Extreme events such as heat waves, cold spells, ﬂoods, droughts, tropical cyclones, and tornadoes have potentially devastating impacts on natural and engineered systems and human communities worldwide. Stakeholder decisions about critical infrastructures, natural resources, emer-gency potential covariates) that are better projected (e.g., oceanic or land temperature), methods such as the SGL and future innovations may enhance projections beyond model simulations alone. We note that the ordinary least squares (OLS) approach serves only as a very naïve baseline for this analysis. In the OLS approach shown here, the number of covariates selected is simply all covariates considered; OLS intrinsically assigns non-zero coefﬁcients to all covariates. With so many covariates, almost certainly the OLS model will have nonsensical non-zero parameters due to issues like multi-collinearity. We acknowledge the fact that procedures like stepwise least squares may improve on the naïve OLS reduce shown here by reducing the dimensionality of the problem. However, forward versus backward versus mixed stepwise procedures have their own set of problems related to multi-collinearity and changes in coefﬁcients with addition or removal of covariates, among others. Still, we present the naïve all-covariate OLS purely as a baseline for comparison without implying that it is or should be used in high-dimensional problems of this nature.

Managing the Risk of Extreme Events and Disasters to Advance Climate Change Adaptation, also known as the IPCC-SREX (Field et al., 2012) and in a perspective article in Nature Climate Change (Coumou and Rahmstorf, 2012). However, the impact of the changing climate on other types of extremes such as severe weather and hydrological events, including floods, droughts, storms, hurricanes, cyclones, and tornadoes, remains unclear. Mitigation policy requires quantifying the benefits of reducing emissions in terms of impacts avoided. Adaptation to natural hazards and constrained natural resources requires credible projections of extremes, along with their uncertainties, at local to regional scales. Delineating possible links between changes in weather extremes with changes in climate or land use are therefore directly relevant to both mitigation and adaptation planning.
High-resolution global climate models (GCMs), in conjunction with downscaling based on statistical or dynamical approaches, may bridge the gap. Unfortunately, the recent literature and our analyses suggest that physics-based modeling alone may not be able to keep pace with the urgency of stakeholder requirements. Each generation of climate models brings new advances, such as the recent expansion of more traditional atmosphere-ocean general circulation models into fully coupled earth system modeling systems in the Coupled Model Intercomparison Project version 5 (CMIP; Taylor et al., 2012). Coupling new models brings its own issues, however, and evaluation studies suggest that, despite noticeable improvements, regional-scale biases persist in the latest generation of climate models, despite enhanced resolutions and the incorporation of additional physical and biogeochemical processes (e.g. Ryu and Hayhoe, 2014;Kumar et al. 2014).
Climate extremes continue to represent a major challenge. Consider the example of droughts: according to the IPCC-AR5 (IPCC 2013) report on the "physical science basis" of climate change, scientific confidence in the ability to characterize and project droughts may have reduced over the last several years (see also Table SPM.1 in the IPCC-AR5 summary for policymakers). Two recent papers on droughts, one published in Nature Climate Change (Dai, 2013) and another in Nature (Sheffield et al., 2012), offered diametrically opposite insights. While Dai (2013) concluded that droughts globally have shown an increasing trend in the past and will worsen in the future; Sheffield et al. (2012) found a lack of trends in global drought over the past 60 years. The differing insights are summarized by Trenberth et al. (2014) in a perspectives article in Nature Climate Change.
Similar opposing insights have been reported for temperature extremes, which are generally relatively better simulated by climate models. Hansen et al. (2012) reported that seasonal temperature anomalies have significantly increased while Huntingford et al. (2013) reported significantly more uncertainty and did not find an increasing trend. Apparent insights can depend on metrics of choice and data analysis procedures (Alexander and Perkins, 2013;Huntingford et al., 2013), adding complexity to the analytic process and interpretation of findings.
Climate-related data are rapidly increasing in size and complexity (Taylor et al., 2012). This begs the question whether data science, which has already transformed disparate data-rich fields from biological sciences to social media to information retrieval, may also offer fresh insights to address fundamental knowledge-gaps related to climate extremes. Going back to the droughts example, Trenberth et al. (2014) suggest that the reasons for the diverging insights were the different choices of underlying data and metrics.
While confidence in scientific understanding and attribution of observed trends to human-induced climate change continues to increase, the IPCC-SREX highlights key gaps in present scientific understanding of climate extremes. Previous and current research on climate extremes typically focuses on one of three areas: the physical science basis, statistics of extremes, or adaptation and potential impacts. Physical science-based analyses tend to emphasize mechanistic understanding and attribution; statistical analyses generally develop data-driven techniques for descriptive and predictive analyses (for example, recent applications of extreme value theory, change detection and sparse regression to climate extremes); and impact studies tend to focus on exposure, vulnerability and consequence assessments. Despite significant progress in all three areas, our ability to establish credible links between climate variability, climate change, and climate extremes is still insufficient to facilitate confident and risk-informed decision-making, particularly at regional and decadal scales.
Reliable projections need to generate interpretable predictive insights while accounting for the knowledge-gaps and intrinsic system variability. The wealth of data continues to increase, as does our conceptual understanding of processes that may generate extremes, such as the influence of oceans and climate oscillators, and local or regional terrestrial drivers. The lack of significant improvement in the latest generation of computer models may suggest that the enhanced understanding may not yet translate to improved projections. Data-driven methods by themselves may not be adequate for long-lead time projections of a nonlinear dynamical system such as climate. Data assimilation methods have limited ability to contribute in the future when projection lead times are large. However, dependence characterization and data-driven predictive modeling may be conditioned on the results of physics-based models, and further based on physical or process understanding, that in turn may be difficult to capture within the current set of model parameterizations. In such cases, pure data-driven methods may lead to spurious correlations or predictions, but physical constraints in the design and interpretation of such methods may guard against the possibility. For example, ocean or atmospheric temperatures from climate models may generate better characterizations and projections of precipitation extremes statistics with uncertainties (e.g., Kao and Ganguly 2011;.

The climate science question: interdisciplinary perspectives
This article focuses on what may be viewed as three interrelated grand challenges in climate change studies: (1) characterization of climate extremes, (2) comprehensive assessment of uncertainties, and (3) enhanced predictive understanding, with a goal of improving projections. Climate and earth sciences have grown from data-poor to data-rich sciences over the last couple of decades, and are likely to be at the forefront of societal challenges pertaining to big data in this century (Overpeck et al., 2011). Can the rapid and recent increases in computational power and analysis capabilities, as well as steady progress on foundational theories in statistics, nonlinear physics, information theory, signal processing, computer science, and econometrics, enable fundamental advances in climate science through computational data sciences? Can the data science methods be carefully designed to avoid spurious generalizations, and to extract physically based patterns that can be interpreted by climate scientists? Solutions for massive data volume and complexity have already made their mark in scientific and engineering disciplines as diverse as biology, astrophysics, and Internet phenomena such as Google or Facebook (Berriman et al., 2010;Langmead et al., 2010;Yang et al., 2011) and spawned new fields of research such as sensor networks (Ganguly et al., 2009a). Climate problems increasingly demand data-driven solutions, but the relevant approaches need to consider relatively unique challenges not present, or not as predominant, in fields where data sciences have proved enormously successful thus far. Thus, carefully designed parallel and distributed algorithms may be required to ensure that sophisticated methods designed for nonlinear processes and complex long-memory or long-range associations can scale and remain resilient to spurious "discoveries".

Big data challenges in climate science
Over the last few decades, climate data has expanded rapidly in both size and complexity. While weather station records remain small and relatively manageable, the advent of the satellite era and remote sensors in general, and the evolution of high-resolution weather and climate models, both of which divide the planet up into ever-decreasing grid sizes, are the primary factors driving data increases. Ensembles of archived climate model outputs have grown from a few hundred terabytes after the last IPCC assessment cycle (AR4) to the petabyte scale (AR5). The global archive of climate data is projected to grow to about 50 PB around 2015, exceed 100 PB by 2020 and reach up to 350 PB by 2030, mainly from model simulations and remote sensing observations, but also from in situ observations (Overpeck et al., 2011;Taylor et al., 2012). The pace of data growth appears to suggest that even these projections may represent lower bounds.
Disk space and processing speeds are perennial challenges. Today, however, the major technical barriers for mining massive data lie in scalable data-intensive analysis capabilities, where fast storage and scalable input/output are major concerns (Schadt et al., 2010;Trelles et al., 2011), as well as mathematical and algorithmic capabilities. Datadriven methods are not new in climate, meteorology or geophysics; the novelty is in the scalability challenges for massive data as well as the opportunities to infer novel process understanding and new predictive insights.
Recent developments in data science have sometimes focused almost exclusively on scalability to massive data rather than data complexity (Armbrust et al., 2010;Dean and Ghemawat, 2008). New methods need to consider several crucial aspects that are rather unique to climate science and related disciplines: climate data exhibit complex space-time dependence; the data-generation processes are highly nonlinear and may be extremely sensitive to initial conditions; variability may occur over long time frames and thus may be difficult to evaluate with limited historical data; spatial dependencies may be based on proximity as well as long range teleconnections with time lags or leads, which makes the discovery of associations a combinatorial challenge; and extremes, unusual patterns or anomalies are of interest, particularly at higher resolutions. The dominance of nonlinear and non-stationary processes, combined with the need for projections (e.g., for extreme values) over long-lead time precludes data-driven projections alone.
Predictability studies (e.g., Karamperidou et al., 2013) leading to characterization of irreducible uncertainties is a major challenge in climate science that may be relatively unique among the urgent big-data challenge areas. Sterk et al. (2012) measured predictability of extremes with relatively simple geophysical models using a finite-time Lyapunov exponent. Delsole and Tippett (2009a, b) proposed a measure based on average predictability considering all lead times without time averaging. Koster and Suarez (2000) studied predictability of precipitation in the context of climate variability. Teng (2010, 2012) studied decadalscale predictability from an ensemble of multiple initial condition runs using relative entropy. Giannakis and Majda et al. (2012) have used data driven methods for dynamical systems (with applications in climate atmosphere ocean science) to quantify predictability and extract spatiotemporal patterns. The approaches are relatively new to climate but have tremendous implications for stakeholders and decision makers. The implications of adapting these methods to big data have not been studied in detail.
Big data has its own unique problems. A major challenge related to working with large data sets is avoiding false positives, especially when looking for patterns in the data that are rare. The problem arises from the fact that when a large amount of data is considered, the probability of encountering random occurrence of the target pattern in the data is also high. From the viewpoint of statistical tests, the p value has little relevance for a sample size big enough to be called really big data. Thus, virtually any null hypothesis will be rejected if the sample size is large enough, since the p value of the null hypothesis will always be almost zero. Bonferroni correction, a theorem of statistics that gives a statistically naive way to avoid these false positive responses to a search through the data, has been used widely in the past with large data sets. However, avoiding false discoveries is still an active research area and several new methods have been proposed in last two decades (Benjamini and Hochberg, 1995;Bogdan et al., 2008;Dudoit et al., 2003;Efron, 2007) that improve upon the Bonferroni theorem both by new methodological and theoretical developments. Another problem with big data arises if one tries to identify the distribution a variable follows based solely on p values. The goodness-of-fit tests become extremely sensitive to small, inconsequential changes when the sample size is large. The issue of false positives with big data has been discussed in the context of a commonly used statistical approach for climate extremes. Resampling techniques have been used to study properties of climate extremes (e.g., Kharin et al., 2005;2007), where the authors also list caveats and challenges for such usage. A recent proposal for bootstrap in big data (Kleiner et al., 2014) and other alternatives require further study, in order to understand how to use resampling for extremes of climate variables from large data sets.
Due to this ever-present risk of coming up with spurious discoveries and insights with big data, the importance of physics-guided data mining needs to be emphasized further. We can either use physical constraints to validate the data-driven knowledge discoveries or incorporate the physical constraints in the knowledge discovery process by mapping them either as statistical constraints or in the selection of variables and distributions.

Societal urgency and state of the science
The types of extreme events discussed here have the potential to cause significant devastation; as shown in Fig. 1, (a) the most significant economic loss results from hurricanes/cyclones and floods, and (b) the largest number of deaths from droughts, tropical cyclones and floods. Mortality and economic losses from tornadoes and severe thunderstorms has been of significant concern in the United States, given the devastating losses in 2011 (Simmons et al., 2012). The size depicting each type of hazard provides a measure of our uncertainty under climate change; unfortunately, we find that the level of uncertainty is generally high for the most destructive hazards (Bouwer, 2011). Even for the relatively better understood temperature extremes, such as heat waves and cold snaps, large uncertainties remain, especially at regional scales (Ganguly et al., 2009b). Recent studies (Fischer et al., 2013;Fischer and Knutti, 2014) suggest that these large uncertainties will likely persist even if climate models improve rapidly (Maslin and Austin, 2012;Kumar et al., 2014). Hazards are expected to be more severe for poorer and more vulnerable regions; developed economies, however, are not immune to loss either, as demonstrated by Paris and Chicago heat-wave mortality (Hayhoe et al., 2010) and United States Gulf Coast hurricane impacts (Burby, 2006). Recent studies have advanced our understanding of observed trends in heavy precipitation or flooding and attributions to global warming (Min et al., 2011;Pall et al., 2011). However, uncertainties remain in interpreting observed extremes (Ghosh et al., 2011;Goswami et al., 2006) and in reliable projections of extremes' intensity-duration-frequency at regional scales  that are crucial for water and flood management. Floods in particular are less well understood owing to cascading uncertainty from projections of heavy rainfall to consequences for surface hydrology and impacts on water management (Schneider and Kuntz-Duriseti, 2002). Nevertheless, generating credible projections of climate variables at regional or GCMs are designed to simulate the large-scale circulation of the atmosphere and its response to external forcing, and need to be evaluated from that perspective. However, adaptation, impacts and vulnerability (IAV) studies occasionally use GCM model simulations directly or after statistical and dynamical downscaling. Model-based assessments in the future need to ultimately rely on GCM projections. Nevertheless, relatively naïve utilization of GCM projections for IAV studies may yield non-informative or even misleading conclusions. This is illustrated though comparisons between the CMIP3 and CMIP5 climate model simulations at continental and global scales in terms of average temperature and precipitation. The two GCM-ensembles are evaluated against the observation-based National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR, NCEP-1) and National Centers for Environmental Prediction/Department of Energy (NCEP/DOE, NCEP-1) reanalysis data. For precipitation, the Global Precipitation Climatology Project (GPCP) observational data is used in addition. Aggregate comparisons do not appear to suggest significant improvements of CMIP5 over CMIP3. While this does not necessarily imply a lack of improvement in CMIP5 over CMIP3 in terms of large-scale dynamics, this does suggest the need for caution when GCMs (with or without downscaling) need to be used for IAV studies. even local scales remains an important step for reliable assessments of hazards and their consequences.
Can improvements in physics and higher-resolution models increase otherwise inadequate precision and enhance the accuracy of projections for climate-related extremes? Projections from global models tend to grow more uncertain with increased spatial and temporal resolution, especially for precipitation, particularly so over the tropics . The possibility that the current-generation and higher resolution CMIP5 models will improve projections compared to the previous-generation phase 3 (CMIP3) models remains to be tested at appropriate scales.
While comparing the performance of GCMs it is important to carefully distinguish between model evaluation versus translating model outputs into information relevant for impacts, adaptation, and vulnerability (IAV) studies. GCMs are designed to model large-scale atmospheric dynamics, and from that perspective, recent results suggest general improvement of the ensemble of CMIP5 models compared to CMIP3 (e.g., Ryu and Hayhoe, 2014). However, any improvement in the internal physics or dynamical behavior of models may not be immediately manifested in, for example, model ability to reproduce absolute values of temperature or precipitation at regional and seasonal scales, or in their extremes. Nonetheless, IAV studies may occasionally rely on GCM simulations of temperature and precipitation for future assessments, either directly or indirectly after statistical or dynamical downscaling. One of the primary functions of downscaling, particularly statistical, is to remove GCM-simulated biases in absolute values for IAV applications that require absolute values to assess impacts. The importance of this step is illustrated in Fig. 2, which compares a 7-member CMIP5 versus CMIP3 ensemble with National Center for Environmental Prediction (NCEP-I and NCEP-II) reanalysis temperature and Global Precipitation Climatology Project (GPCP) precipitation. Based on a straightforward comparison, no improvements are apparent either in terms of the multimodel median projections or in terms of the uncertainty bounds as expressed by the range of the multimodel ensemble. In fact, CMIP5 almost consistently predicts higher temperatures and precipitation compared to the CMIP3 multi-model median, but these higher values do not necessarily agree better with 782 A. R. Ganguly et al.: Toward enhanced understanding and projections of climate extremes the observations. These preliminary results (further details in Kumar et al., 2014) may appear to provide further support to arguments (Hulme et al., 2009) that model improvements alone may not provide immediate answers to stakeholder questions or adaptation needs and additional analyses are clearly required in order to extract information from GCM simulations directly relevant to and able to be used by IAV assessments. This is precisely where big data solutions (and in the case of extremes, big data solutions that are ultimately geared towards rare event and small data, or elusive indicators thereof) may provide value. Improvements in internal physics and large-scale dynamics of GCMs may not directly improve the variables of most immediate interest to IAV studies. However, data-driven methods may still be able to leverage the improvements in the larger-scale or internal model variables and yield improved projections for the variables of interest to IAV. For the data-driven projections to be interpretable and useful, they need to be guided by physical understanding, where the latter physics may not be directly captured by GCMs, perhaps even after downscaling.

Characterization of climate extremes
Climate extremes often refer to well-defined weather or climate events that are quantified using measurable physical quantities such as temperature, precipitation, or wind speed and that are rare (i.e., occurring at the tails of the distribution) relative to current climate states (Zwiers et al., 2013). The definition of climate extremes, in general, varies with the nature of the phenomena and may be based on their impacts. Extremes such as hurricanes, tornadoes and floods cause immediate and widespread devastation, while droughts tend to unfold slowly, are spatially extensive, non-structural and have longer-lasting impacts. While phenomena like heat waves under climate change are better understood than most other climate-related extremes (Coumou and Rahmstorf, 2012;Field et al., 2012), their very definitions may depend on the impact sector of interest (Ebi and Meehl, 2007). Quantitative research relating climate extremes and anomalies to impacts, for example terrestrial ecology Zscheischler et al., , 2014 and agricultural production (Lobell et al., 2006(Lobell et al., , 2012, often examine climate indices derived from extremes with disciplinary specificity. Figure 3a shows that different definitions of hot extremes can significantly impact the final insights; however each definition remains useful for its specific context, such as energy demand (Christenson et al., 2006) or public health (Kovats and Kristie, 2006).
As model-simulated and observational databases, and the importance of informing adaptation or mitigation policy, continue to grow, descriptive analysis of multiple definitions of model-projected and observed extremes will at once become a larger and more complex task. Surprising insights about cold temperature extremes (Kaspi and Schneider, 2011;Kodra et al., 2011) are still being discovered from observed and model-simulated data. Thus, while decreasing frequency of cold extremes has been reported (Coumou and Rahmstorf, 2012), there is still a need for better characterization and improved mechanistic understanding of their potential persistence in a warming world.
Recent advances in attribution of heavy rainfall do not directly translate to improved information for adaptation (Min et al., 2011;Pall et al., 2011). Thus, intensification of precipitation extremes under warming, which is partially explained through our conceptual process understanding (O'Gorman and Schneider, 2009;Sugiyama et al., 2010), is projected relatively credibly in the extra-tropics and at continental to global average scales . However, large uncertainties remain in estimating the precise degree of change and for specific regions such as the tropics , where diverging insights (Ghosh et al., 2011;Goswami et al., 2006) have been recently reported owing to differing characterizations of extremes. Extreme value theory (EVT) has been used in hydrology (Towler et al., 2010) and climate (Ghosh et al., 2011;Kao and Ganguly, 2011;Min et al., 2011) to characterize rainfall extremes. Moreover, hydrological extremes are described by several mutually correlated characteristics; such as peak flow, volume and duration (Zhang and Singh, 2007) for floods and severity, duration, intensity and spatial extent for droughts (Reddy and Ganguli, 2013;Song and Singh, 2010).
Univariate frequency analyses cannot provide accurate assessment of the probability of occurrence of extremes if the underlying event is characterized by mutually correlated random variables and may lead to over or under estimation of associated risk (Chebana and Ouarda, 2011). Hence, multivariate statistical approaches are often necessary in order to completely assess risk of hydrological extremes. Further developments in the statistical theories related to multivariate extremes are needed for advancing our ability to quantify the complex dependencies of climate extremes more completely, and with greater certainty (Kuhn et al., 2012;Marty and Blanchet, 2011;Mastrandrea et al., 2011;Turkman et al., 2009;Wadsworth and Tawn, 2012). Descriptions of rainfall extremes, whether based on EVT or fixed/dynamic thresholds, need to characterize changing statistics of storm events Ganguly, 2011), droughts (van Huijgevoort et al., 2012) and be relevant to multiple sectors, including hydraulic infrastructure design, flood and drought management policy. A recent study of probable maximum precipitation (PMP) and climate change (Kunkel et al., 2013) may offer new ways to blend physics and data-driven insights for precipitation extremes.
Can data-driven methods provide new insights for understanding and characterizing these extremes? Figure 3bc presents fully automated and computationally efficient spatio-temporal characterization of long-term droughts using Here we present three different choices: an energy-consumption related metric called cooling degree days or CDD (left panels); a heat-wave intensity index (Ganguly et al., 2009b) thought to be relevant for human mortality defined as consecutive nighttime minima events (middle panels); an index grounded in the statistical theory of extreme values . The substantial regional differences suggest the differences in the nature of the insights. (b) Novel data-driven approaches can help detect climate-related extremes, particularly ones like droughts that are especially difficult to characterize. Our analysis (bottom left panel) suggests that Markov-random-field (MRF) based approaches may improve the detection process but traditional implementations may not scale to large data. We have developed a new, computationally efficient optimization solver to implement the MRF (Fu et al., 2012). As a proof of concept, here we show how the new method detects persistent and significant droughts over space and time. (c) We used three popular methods to solve the same MRF inference. Our algorithm for characterizing droughts, "KL-ADM", is approximately one order of magnitude faster than an existing popular routine called "Proximal" (dark red) and much faster than any commercially available software (e.g., IBM ILOG CPLEX Optimizer, http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/). The first (second) data set (x axis) is a simulated data set with 100 000 (200 000) variables and 293 500 (586 000) two-way relationships among them, where each variable can take on 3 (4) possible values. The third data set is the Climate Research Unit precipitation data set, which has more than 7 million variables (i.e., points in space) and each can take on two possible values (drought or no drought). This example clearly shows a significant speedup in computation using KL-ADM, especially with four parallel processors. a Markov random field (MRF)-based approach (Fu et al., 2012); this type of MRF approach has been validated by automatically detecting the intertropical convergence zone from instantaneous satellite data (Bain et al., 2011). The algorithm was able to detect some of the major global droughts and proved to be efficient in detecting droughts as compared to fixed percentile-based approaches. The method has been applied to detect all persistent droughts over the past century . Negative precipitation anomalies of at least 5 years are considered as significant (hydrologic) 784 A. R. Ganguly et al.: Toward enhanced understanding and projections of climate extremes droughts and shown here for data from 1970 to 1998. The Sahel drought is clearly detected, as are several others. While this analysis uses a single variable, specifically Climatic Research Unit (CRU) precipitation observations, the method is capable of handling multiple variables that contribute to the characterization of droughts, such as precipitation, soil moisture, and geopotential height. In fact, this MRF-based approach, once generalized to multiple variables, may be viewed as a methodological improvement to the waveletbased method (Narisma et al., 2007) for abrupt drought detection in the literature. One of the advantages is the ability to fully automate the drought detection procedure with a lesser number of predefined parameters, which may be useful for the detection of megadroughts from paleoclimate data or plausible megadroughts from model projections. The value addition of the MRF-based approach, beyond proof of concept detection of known droughts, would be demonstrated when the methods are generalized for multiple variables, and subsequently used for the evaluation of historical multi-model ensembles as well as for the generation of future projections with uncertainty from model projections in forecast mode. Computationally scalable and flexible detection approaches based on spatio-temporal similarity between drought events (Lloyd-Hughes, 2012) have also recently been developed. On a completely different scale, our recent research (Ganguli and Ganguly, 2013) explores severity-duration frequency curves for observed meteorological droughts over the continental US during the last few decades through copula-based approaches.

Computational challenges in downscaling
As long as the spatiotemporal scales relevant to stakeholders and policymakers are inadequately resolved by GCMs, downscaling will continue to remain highly relevant to impact analyses. Driven by GCMs outputs, downscaling inherits many of their problems and generates (often massive volumes of) additional data, thus amplifying the big data challenge in terms of both data size and complexity. Statistical downscaling (Bürger et al., 2012;Mannshardt-Shamseldin et al., 2010;Robertson et al., 2004) model outputs are relatively computationally inexpensive to generate, but criticisms (Eden et al., 2012;Schmith, 2008) have focused on model complexity and the lack of clarity on whether statistical models will perform well far into the future or on disparate regions. Dynamical downscaling (Pierce et al., 2012;Trapp et al., 2010), based on regional climate models, is much more resource intensive and is not independent of stationarity assumptions in sub-grid scale parameterizations, either. The primary advantages over statistical downscaling are explicit incorporation of topography and higher-resolution process models, which are critical given the possible importance of finer-scale processes (Jung et al., 2012;Diffenbaugh et al., 2005). However, regional climate models parameterize such processes, often leading to significant inter-model disagreement, e.g., on precipitation (Palmer et al., 2004). Figure 4 illustrates the ability of both statistical (Ghosh, 2010) and dynamical (Heikkilä et al., 2010) downscaling to provide precise insights compared to the original global model results. Dynamical downscaling over the island nation of Sri Lanka (Fig. 4a-c) suggests, upon visual inspection, that the approach may be able to better capture the expected influence of topography on heat waves beyond global models, particularly since successive resolution enhancements reveal distinct orographic patterns. On the other hand, the statistical hypothesis test does not necessarily indicate significant improvement, which suggests the importance of multimetric explorations and rigorous evaluation of downscaling results. However, while the value of dynamical downscaling as a tool for hypothesis testing cannot be denied (despite news articles such as Kerr, 2013), the propagation of uncertainty (Sain et al., 2011) remains a challenge for projections. Over India, while global models suggest a uniform increase in rainfall extremes trends, the results from statistical downscaling ( Fig. 4d-f) show evidence for considerable geographical heterogeneity, which in turn agree with the latest findings on spatial variability of extremes (Ghosh et al., 2011).

Complexity of uncertainty assessments
A thorough and comprehensive characterization and quantification of uncertainty, which may result from imprecise observational data, inadequate models, or intrinsic climate system variability, is invaluable for stakeholders and policymakers but difficult and often even impossible to achieve. Bestestimate projections and corresponding uncertainty bounds under climate change are sometimes thought to be better captured with multimodel ensembles. It is important to evaluate the ability of models to simulate historical climate patterns (Pierce et al., 2009), but that alone may not be sufficient for climate models in view of non-stationarity and long lead time projections. Multimodel agreement in the future becomes an important metric, with the notion that consensus implies higher certainty (Smith et al., 2009;Weigel et al., 2010). Empirical studies suggest that averages of output from multiple models outperform individual models, this insight being insensitive to which specific models are averaged (Pierce et al., 2009). However, the value of multimodel averages has been questioned (Knutti, 2010), particularly for regional assessments Kodra et al., 2012). Recent attempts at regional assessments include the development of statistical methods that consider both model performance relative to historical observed data and model ensemble agreement (Smith et al., 2009;Ganguly et al., 2013).
One way to improve the uncertainty assessment approaches may be to consider physical and correlative relations in combination with historical model skills and future multimodel agreement. For example, in Fig. 5, observations  (c), does not yield substantial evidence for differences in spatial distributions of model runs, which is probably owing to small sample sizes for the 100 km and 36 km resolution data. However, the effects of topography in the mid-southern Sri Lanka appear more prominent at higher resolutions. The sheer size of the newly generated dynamically downscaled simulations, as well as the problem complexity, further intensifies the need for big data solutions. Geographical heterogeneity in the trends of rainfall extremes over India, shown in a recent observation-based study, is suggested after downscaling but not directly from the global model runs (Ghosh et al., 2011). and model simulations may exhibit regional differences in their adherence to known physical relations. Evaluating the extent to which observed rainfall extremes follow physical relationships like the Clausius-Clapeyron (CC) may help identify systematic patterns in extreme rainfall behavior that could be encapsulated in multimodel uncertainty quantification methodology. We are not aware of any existing statistical strategy (e.g., along the lines of Smith et al., 2009) that attempts to explicitly utilize theoretical physical processes in addition to historical skills and multimodel agreement. . Uncertainty quantification adds to the big data challenge. Multimodel ensembles have been used to quantify uncertainty in the structural representation of climate physics; their performance has been evaluated by investigating skills in reproducing historical behavior (skills) and multimodel agreement (convergence) in the future. Here we investigate the uncertainty in precipitation extremes and explore whether physically based relations, like the temperature-dependence of precipitation extremes through the saturation vapor pressure (known as the Clausius-Clapeyron, or CC, relation), may help further inform uncertainty assessments and skill-based model selection. For the (a) southeastern and (b) southwestern US, a 7-member CMIP3 model ensemble is used for the analysis, with NCEP2 used as a baseline model and the theoretical CC curve shown for comparison. Every point from each model represents a 20 year mean temperature (1980)(1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999) on the x axis and a 30 year daily rainfall return value, i.e., a daily rainfall intensity that on average occurs or is exceeded only once in 30 years (1980-1999, y axis) with nonlinear regressions fit to each data set and uncertainty bounds computed using a bootstrap-based resampling procedure. The value of using the multivariate physically based CC relation in uncertainty quantification is suggested, particularly for extremes (specifically, heavy rainfall) where covariate relations (specifically, temperature-dependence) are known from process physics (e.g., Clausius-Clapeyron).
In Fig. 5a, the observations and multiple GCMs are compared to the theoretical CC (scaled to compare with the other curves) over the eastern US. An analogous plot is shown for the southwestern US in Fig. 5b; the use of different regions makes apparent the degree to which data (observed and modeled) adheres to conceptual physical relations (in this example, the CC). Each point represents a 20-year mean temperature (1980-1999, x axis) and an estimated 30 year return rainfall value (calculated from 1980 for a land-based grid cell. Polynomial spline regression, a nonparametric smoothing regression approach (Taylor, 2012), is used to fit the rainfall return values on mean temperature. This is performed for all models and for NCEP2 reanalysis. Regression model fits are depicted by the colored lines. The theoretical CC relation is depicted by the red line; a manually calibrated multiplication scaling factor of 0.00023 (0.00027 for the southwest) was applied for visual purposes (to line the CC up in the same space as the data) that should not affect the results significantly. Note that the level of the CC line has no real meaning beyond this scaling; only the exponential pattern does. Uncertainty bounds for the multimodel ensemble are created with a resampling scheme combined with the same spline regression.
Besides model-to-model uncertainty, internal model variability due to different choices of parameters is also a major source of uncertainty but is more difficult to quantify due to computational constraints. Generating model simulations with multiple sets of parameters generates (Stainforth et al., 2005) a large number of simulations from a single GCM but requires enormous computational resources (Stainforth et al., 2002(Stainforth et al., , 2005. Such an approach may generate substantial single model insights (Stainforth et al., 2005) but is not yet feasible on a massive, multimodel scale. Evaluation of multiple models remains an important step in comprehensive uncertainty assessments, even though structural differences may make inter-model comparisons difficult and at times even infeasible.
Requirements to provide uncertainty estimates almost invariably magnify the data challenge, both by generating more model-simulated data (Stainforth et al., 2005) and/or by requiring more data-intensive approaches. Even relatively easily parallelizable approaches like the bootstrap method, which has been used with EVT (Ghosh et al., 2011;Kao and Ganguly, 2011; to characterize uncertainties in return level estimates of climate variables, can benefit significantly from parallel processing. The recently developed method of "bag of little bootstraps" (Kleiner et al., 2014), claims to have significantly improved the timecomplexity of the bootstrap method for large data sets with theoretical guarantees of correctness of uncertainty estimates. The adaptation of these techniques in space and time for observed and model-simulated climate data across multimodel and multiple initial condition runs and with different statistical estimation approached may represent major challenges.

Enhanced understanding and predictions
Climate extremes, such as heavy rainfall or tropical cyclones, are known to depend on other climate variables (including mean states, local or regional variables, as well as large-scale effects such as oceanic indices) that may be better simulated by models, such as land and sea surface temperatures. Developments in correlative analysis (Reshef et al., 2011;Khan et al., 2007;Kinney and Atwal, 2014), extended to handle correlated data at multiple spatial and temporal scales, may help quantify conceptual understanding and possibly even discover new dependencies (Khan et al., 2006). Challenges in analyses of historical extreme events such as tornado and hurricane data involve attributing spatial and temporal scales of their behavior to climate change versus natural variability (Emanuel et al., 2008;Webster et al., 2005), as well as to data collection issues for tornadoes and cyclones (Brooks and Doswell, 2001;Emanuel, 2005) and discontinuity of operational definitions for tornadoes (Doswell et al., 2009). Innovative data-driven approaches that consider these complexities are needed to build understanding of the physical behavior and drivers of tornadoes and hurricanes because physicsbased modeling for these types of processes is still in early stages (Emanuel et al., 2008;Trapp et al., 2010).
New process understanding or novel insights from mining climate data may help enhance projections and ultimately reduce uncertainty. Although relatively coarse-resolution GCM are not able to directly simulate tropical cyclones, they have been used to develop aggregate statistics of hurricanes (Emanuel et al., 2008) under climate change. In the same manner, temperature and updraft velocity profiles have been used to constrain or enhance multimodel projections of precipitation extremes (Knutson et al., 2010;Wilhite and Glantz, 1985). Additionally, ensembles have been found to simulate robust statistics of severe thunderstorm environments and imply increased risk in possible convective hazards under global warming (Diffenbaugh et al., 2013). These approaches point to the information content in auxiliary variables relevant for climate extremes, and with appropriate adaptations, may lead to a virtuous cycle where data-driven insights and process understanding mutually inform, complement, and improve each other. Recently, even tornado occurrences have been associated with monthly environmental parameters (Tippett et al., 2012), though not necessarily in a climate change context.
Linear dimensionality reduction has been used (Mishra et al., 2012) for advancing understanding of climate processes like monsoons, which are known to be important for hydrometeorological extremes. The relationships among large and high dimensional climate data can improve understanding of dominant processes and lead to enhanced projections through predictive modeling. The IPCC-SREX indicates that crucial processes that may influence climate extremes, such as El Niño or other climate oscillators and monsoons, are not well understood. Inferences from surrogate data may yield new insights on extremes processes: the use of ocean salinity data to understand the intensification of climate extremes (Durack et al., 2012) provides an example using a proxy data set for precipitation. Figure 6a-b provides an example where new data mining methods (Kawale et al., 2011(Kawale et al., , 2013 for dipole discovery were used to extract information about climate oscillators that may be useful for model evaluation. An intelligent combination of process understanding with data mining methods may yield new explainable predictive insights beyond statistical downscaling. In fact, the premise of statistical downscaling (discussed earlier), where one overall approach is linear dimensionality reduction followed by nonlinear regression (Ghosh, 2010), is that lowerresolution model outputs have information content about higher-resolution variables. We propose taking this one step further. Variables that are more reliably projected by climate Dipoles are a class of teleconnections, or long-range dependence in space, that represent a persistent and large-scale temporal negative correlation in a given climate variable between two neighboring or distant geographical locations. The dipoles shown here are generated using the shared reciprocal nearest neighbors (SRNN) algorithm graphical approach (Kawale et al., 2013). The edges of the graph, shown in the figure, represent dipole connections between two regions, while the color in the background shows the SRNN density, where darker colors signify regions of higher connectivity. This class of methods may be useful for systematically detecting, refining existing, or even identifying new, climate teleconnections or oscillations, as well as for model evaluation. (c) While critical for statistical downscaling and relating ocean-based indices to regional land climatology, regression problems in climate may be particularly difficult to solve reliably, owing to issues like high dimensionality (large input variables compared to the number of calibration or training data), proximity-based spatial correlations, and teleconnections. Here we use multiple ocean variables (as predictors or covariates) to predict changes in land precipitation for multiple regions using NCEP1 reanalysis.
The results indicate that a new approach, called the Sparse Group Lasso (SGL: Chatterjee et al., 2012), outperforms ordinary least squares and LASSO regressions (Tibshirani, 2011) as per both error-based predictive accuracy and model parsimony. Model parsimony refers to simpler models with lesser number of parameters, which in turn tend to generalize better than more complex models, especially if predictive accuracy on training data remains identical or also gets lower. Where climate extremes of interest (e.g., hurricanes or rainfall extremes) are projected less reliably but relate to variables (i.e., potential covariates) that are better projected (e.g., oceanic or land temperature), methods such as the SGL and future innovations may enhance projections beyond model simulations alone. We note that the ordinary least squares (OLS) approach serves only as a very naïve baseline for this analysis. In the OLS approach shown here, the number of covariates selected is simply all covariates considered; OLS intrinsically assigns non-zero coefficients to all covariates. With so many covariates, almost certainly the OLS model will have nonsensical non-zero parameters due to issues like multi-collinearity. We acknowledge the fact that procedures like stepwise least squares may improve on the naïve OLS reduce shown here by reducing the dimensionality of the problem. However, forward versus backward versus mixed stepwise procedures have their own set of problems related to multi-collinearity and changes in coefficients with addition or removal of covariates, among others. Still, we present the naïve all-covariate OLS purely as a baseline for comparison without implying that it is or should be used in high-dimensional problems of this nature. models may be used not only to improve our process understanding but also to enhance projections of the climate extremes of interest. For enhanced climate projections, especially given the importance of spatiotemporal neighborhoods, prevailing winds, intra-decadal to multi-decadal climate oscillators, and teleconnections, the number of potential explanatory variables may far exceed the number of observations available, which creates problems for classic regression.
Popular dimensionality reduction approaches like empirical orthogonal functions (Hannachi et al., 2007) summarize complex data succinctly but may not necessarily do so in a way that maximizes information useful for predicting a specific variable. Sparse regression (Negahban and Wainwright, 2011;Negahban et al., 2012) represents promising alternatives under these situations. Sparse regressions based on constraining the L1-norm of the regression coefficients became popular due to their ability to handle high dimensional data unlike the regular regressions, which suffer from overfitting and model identifiability issues especially when sample size is small. They are often the method of choice in many fields of science and engineering for simultaneously selecting covariates and fitting parsimonious linear models that are better generalizable and easily interpretable. Sparse regularization methods have just begun to be applied to statistical downscaling (Ebtehaj et al., 2012;Phatak et al., 2011) . However, this method can also be applied for improved understanding of the complex dependence structure between climate variables, especially in a high-dimensional setting (Chatterjee et al., 2012;Das et al., 2012Das et al., , 2013. Dimensionality reduction techniques that utilize manifold, atomic, and topological structures derived directly from physical laws (Kpotufe, 2009(Kpotufe, , 2011Balakrishnan et al., 2013a, b;Kpotufe and Garg, 2013;Lum et al., 2013;Wang et al., 2014) at once could make the prediction problem both more computationally tractable and physically sensible. High-performance computational challenges related to this general approach represent an active area of research.
Networks that connect nodes defined as spatial grid-cells (Steinhaeuser et al., 2011b;Donges et al., 2013) or climate oscillators (Donges et al., 2009a), often known as "climate networks", may be useful when representing climate dependencies and develop process understanding (Donges et al., 2009a, b). Figure 6c provides an example of new datadriven predictive approaches (Chatterjee et al., 2012) that appear well-suited for high-dimensional and geographically distributed climate data with complex dependence structures. Network-based graphical models have been used to discover causality among different modes of climate variability (Ebert-Uphoff and Deng, 2012;Runge et al., 2009). Applications of methods in nonlinear data sciences, from complex networks (Steinhaeuser et al., 2011a) to multifractals (García-Marín et al., 2013;Muzy et al., 2006) and dynamic Bayesian networks (Troy et al., 2013), have demonstrated initial promise for better description and predictive insights on climate-related extremes such as extreme monsoonal rainfall over south Asia (Malik et al., 2011). Certain methods may eventually be applicable in a climate change detection context, potentially making similar innovations useful for not only long horizon prediction and uncertainty reduction but also for relatively abrupt change and disturbance analysis or even for early warning systems.

Summary
One of the largest scientific gaps in climate change studies is the inability to develop credible projections of extremes with the degree of precision required for adaptation decisions and policy (Fischer et al., 2013). The dire consequences of climate-related extremes, even in developed economies (Gall et al., 2011), may call for a range of wellinformed adaptation strategies from low-regret (Wilby and Keenan, 2012) to transformative (Kates et al., 2012). Improving regional projections (e.g., through variable selection or statistical downscaling) and characterizing natural variability (e.g., irreducible uncertainty at decadal scales: Sutton, 2009, 2011;Branstator and Teng, 2012;Deser et al., 2012a, b;Fischer et al., 2013;Hu and Deser, 2013;Rosner et al., 2014) are necessary for informing adaptation at stakeholder-relevant scales and planning horizons. As climate-related data approaches the scale of hundreds of petabytes (Overpeck et al., 2011) and climate data mining research continues to improve (Smyth et al., 1999;Robertson et al. 2004Robertson et al. , 2006Khan et al., 2006;Camargo et al., 2007a, b;Gaffney et al., 2007), new opportunities will emerge (e.g., Schneider et al., 2013;Monteleoni et al., 2013;Ganguly et al., 2013). The 2014 Climate Data Initiative (Lehmann, 2014) launched by the White House (United States President's Office) points to big data as a solution for climate adaptation and lends further urgency of the theme discussed in this manuscript. However, despite the promise, pitfalls in pure data mining methods have been pointed out in the context of climate. For example, Caldwell et al. (2014) shows how naive applications of data mining may yield spurious relationships in climate. This paper emphasizes the need to intelligently combine an understanding of physics with data mining, not just to avoid the risk of generating misleading insights, but also to produce novel results that may not have been possible otherwise. Data-driven methods may be complementary to physics and may need to be constrained by physics (e.g., see Majda and Yuan, 2012;Majda and Hardin, 2013). When mining climate model simulations, data mining is conditioned on the embedded physics in the models, and aspires to extract relations that may further inform and augment our current physical understanding. However, to be successful, data mining methods need to be aware of the complexity of climate processes and data. The methods may be motivated from often disparate data-science disciplines such as statistics and econometrics, machine learning and data mining in computer science, nonlinear dynamics in physics and signal processing in engineering. The blend of physics and data-driven insights has conceptual similarities with data assimilation methods (e.g., Gerber and Joos, 2013). However, data assimilation methods are ultimately constrained by the physics encoded within climate models, and updates to parameters or state variables cannot be made in the future where no observations exist. The physics-guided data mining discussed here refers to, for example, physics-motivated decomposition into component processes (e.g., Ganguly and Bras, 2003, offers an example in weather forecasting), physically motivated variable selection in statistical downscaling (e.g., certain analog methods; Zorita and von Storch, 1998), or physics-based model selection (Fasullo and Trenberth, 2012) and physically guided climate networks (Donges et al., 2009b;Steinhaeuser and Tsonis, 2013). The climate extremes exemplars discussed here are a collection of outstanding challenges where data mining already does or can play an innovative new role; various scientific communities will have to decide which specific directions to pursue guided by a combination of stakeholder priorities and which problems they are best positioned to address. Once developed and refined, physics-guided data mining methods are well positioned to produce new scientific understanding and credible projections of climate extremes leading to more informed adaptation and policy.