Indian Monsoon rainfall extremes

Introduction Conclusions References


Introduction
Indian rainfall extremes have been studied by many in the past few decades with varying published results.Goswami et al. (2006) used daily central Indian rainfall in a linear regression framework and noted significant rising trends in frequency and magnitude of extreme (≥ 150 mm day −1 ) rain events.Ghosh et al. (2009) conducted a similar study at a finer spatial scale, using 1 • × 1 • data vs. the 10 • latitude ×12 • longitude (central India) used in Goswami et al. (2006).Their findings indicated more of mixture of increases and decreases of extreme rainfall events dependent on location.Ghosh et al. (2012) expanded on this result and added that there has been increasing spatial variability in observed Indian rainfall extremes.Methodology in this research included the use of extreme value theory (EVT) as an attempt to account for the modeling of extremes.Turner and Annamalai (2012)  for projections of monsoon rainfall due to linked variability on different temporal and spatial scales.
Regression techniques and EVT failed to adequately capture the broad statistical properties of the Indian summer monsoon precipitation.Several drivers of monsoons have been proposed in the literature including Himalayan/Eurasian snow extent (Kumar et al., 1999), Pacific trade winds (Li and Yanai, 1996), atmospheric CO 2 concentration (Prell and Kutzback, 1992), and most commonly sea surface temperatures (SST) attributed to El Niño-Southern Oscillation (ENSO) (Turner and Annamalai, 2012;Kumar et al., 1999;Li and Yanai, 1996;Prell and Kutzback, 1992).However, none have been conclusively shown to drive the monsoon rainfall which suggests an intricate relationship between some or all of these factors.
Based on the above context, we propose using a generalized linear mixed model (GLMM) as a potential broader framework for analysis of Indian monsoon precipitation data.A GLMM is a broader framework compared to the standard (linear, log-linear, logistic, or other) regression in that there are random effects involved.This implies part of the signal is random, and changes from one set of circumstances to another.In the current context, a GLMM may be a suitable model for capturing local, instantaneous variability.Such local variability may arise from cloud and other physical microproperties.In case there is no such local variability, an appropriate variance component in the GLMM would be zero, thus recovering any true underlying "fixed-effects" regression model.GLMMs are commonly used in epidemiological and other biostatistical areas when repeated measures are available for each "subject" (stations, states, etc).A GLMM essentially allows for some underlying forces to drive the observable data in a particular hierarchy.This could be key in climate applications as we may not sufficiently be able to attribute drivers of extreme rainfall to observable data.One issue in GLMMs that has been studied since their introduction in Stiratelli et al. (1984) is consistent estimation, i.e. with enough observations the estimates essentially represent the truth.Introduction

Conclusions References
Tables Figures

Back Close
Full This paper provides a glance on how to extend GLMMs to handle climate applications, where extremes and complicated temporal and spatial dependency patterns are important.We present a case study on Indian monsoon summer precipitation.One of our major findings is that the random effect variance component is non-negligible in the analysis of extreme precipitation due to Indian monsoons.It may be significant for light or moderate precipitation as well, although further studies are needed to verify this claim.The estimates of variability in the models for extreme rainfall indicated a meaningful correspondence to El Niño events over the time frame examined, which also deserves further analysis.
Section 2 provides a short background on GLMMs and in particular, elucidates the logit-normal model.Section 3 provides a background on the theory and application of several estimation methods for GLMMs.Section 4 provides the results of several simulations using these existing methods.Section 5 implements the application of these methods in their current state on precipitation data from India.Finally, Sect.6 presents conclusions and future work in this area.

GLMM background
Assume we have data from an exponential family which are independent conditional on a vector of unobservable random effects where "subjects" IDs run over i = 1, . . ., m and the number of (possibly correlated) measurements per subject is from j = 1, . . ., n i .Following the notation of McCulloh and Searle (2010), a GLMM can be written as:

Conclusions References
Tables Figures

Back Close
Full Here Y i = (Y i 1 , . . ., Y i n i ) T , U is the vector of random effects of suitable length, x i j is a vector of covariates for the fixed effects of the i th subject at the j th time, and z i is a vector of covariates for the random effects of i th subject.
A highly applicable version of this model is the logit-normal GLMM.A simple hierarchical form of the model is: This implies the density for a single observation Y i j is Thus, the density for the i th subject is The assumption of conditional independence among subjects implies the joint density of the vectors y i is The assumption of independent standard normal random effects implies the joint density of (Y, U) is

Conclusions References
Tables Figures

Back Close
Full However, since the random effects are unobserved, to utilize the observed data likelihood, one must find the marginal distribution with respect to the observed data Y only.
This integral is almost never analytically tractable, thus, maximum likelihood estimation is very difficult, if not impossible.Many methods for inference have been proposed.
Variants of some of the most popular methods are examined later in this paper.
3 Methods for estimating in GLMM 3.1 Penalized quasi-likelihood with Laplace approximation Breslow and Clayton (1993) were the first to introduce a feasible method for computation in GLMMs.Penalized quasi-likelihood (PQL) approximates the high-dimensional integration using the well-known Laplace approximation (Tierney and Kadane, 1986) and the approximated likelihood function has that of a Gaussian distribution.
There are a few implementations of PQL which were used within the simulations.The first function, glmer{lme4}, is from a very popular mixed modeling package in R. According to Bates (2010), they define the conditional mode, ũ as the value that maximizes the unscaled conditional density given in Eq. (2).From there, they determine ũ as the solution to a penalized nonlinear least squares (PNLS) problem solved by adapting iterative techniques, such as the Gauss-Newton method.Once the algorithm has converged, the method calculates the Laplace approximation to the deviance where the Cholesky factor and penalized residual sum of squares are both evaluated at the conditional mode, ũ.
The second function, glmmPQL{MASS} relies on multiple calls to lme{nlme} which is the previous version of the lme4 package.PQL is reasonably accurate when the data are approximately normal and can be very fast depending on the algorithm used Introduction

Conclusions References
Tables Figures

Back Close
Full for implementation.However, Lin and Breslow (1996) and others have criticized this method for it's bias in highly non-normal data.

Method of simulated moments
Jiang (1998) describes methodology known as the method of simulated moments (MSIM).The method first derives a set of sufficient statistics.Estimating equations are then obtained by equating sample moments of sufficient statistics to their expectations.An example of the system of equations to solve in the simple logit-normal intercept only case (x The expectations on the right hand side are generated by use of Monte Carlo simulations and the whole system can then be solved by the Newton-Raphson method or some asymptotically equivalent algorithm.We implemented this method in a newly created R package. As shown in Jiang (1998), this method is consistent and computationally much less intensive than a Markov Chain Monte Carlo (MCMC) method.However, limitations include slow convergence and less than ideal small sample properties.

Data cloning
GLMM estimates can be produced in a traditional Bayesian framework; one must choose priors for the parameters of interest and calculate the posterior distribution by multiplying the prior densities by the likelihood in Eq. (3).One may then use MCMC Introduction

Conclusions References
Tables Figures

Back Close
Full to generate a dependent sample from the posterior distribution from which estimates can be derived based on strong laws.Lele et al. (2010) derived a method called data cloning to be used in conjunction with MCMC.The basic idea of their algorithm can be summarized in the following three steps.First, create a k-cloned data set Y (k) = (Y, Y, . . ., Y) where observed data vector is repeated k times.Second, using an MCMC algorithm, generate a dependent sample from the posterior distribution π k (θ|Y) which corresponds to the k-cloned data.Third, calculate the sample means and variances of the components of θ; the MLEs of θ correspond to the sample means and the approximate variances of the MLEs correspond to k times the posterior variance of the original data.This method was implemented using dclone{dclone} discussed in Solymos (2010).This method is computationally intensive as it involve MCMC, however, it may provide more accuracy than MSIM in small samples.
As an additional method, we proposed a hybrid method using the nonparametric bootstrap and the idea of the data cloning method (Boot Dclone).This was done by generating the data, running a nonparametric bootstrap on the rows of the data to create (k − 1) new data sets, and then appending the original and all the bootstrapped data into one large set.From there, the data were processed by dclone as if it were a single clone.In essence, this created an approximately k-cloned data set to be run through the appropriate MCMC algorithm.sets; means and standard errors of the 100 sets of estimates were then calculated.We set the true values of the parameters as (µ, σ 2 ) = (2, 1).
To describe the estimation discrepancy between µ and μm,n , we used squared error loss, Q( μm,n ) = ( μm,n − µ) 2 .Because squared error loss is criticized for a bounded parameter space, we used Stein's loss, S( σ 2 , to measure how well σ 2 was estimated.A combined loss was then calculated as

Simulation analysis
The results of the simulations are displayed in Tables 1-5.Unsurprisingly, all methods failed to reasonably estimate both µ and σ 2 in the smallest scenario with 10 subjects and 2 observations each.MSIM, dclone, and glmer estimated µ within 2 standard errors for all other settings.The methods also provided reasonable estimates of σ 2 for settings other than those that involved 10 subjects.In general, these three methods indicated a mixture of over and under estimates in each parameter showing no obvious bias.
Boot Dclone displayed an obvious positive bias for both parameters in all estimates; this was exceptionally noticeable in settings with only 2 observations.Reasonable estimates for µ were made in some of the 200 observations per subject settings; σ 2 was never estimated well.glmmPQL did not converge toward true values and seems to display an underestimating bias in both the µ and σ parameters.There were also issues with the function being able to produce estimates for some of the 100 simulations.MSIM was implemented in two ways for the speed test.In the simplest interceptonly form of the logit-normal model, the code can be reduced to eliminate the use of matrices and can handle much larger data.This is the method used in the previous section.However, a more general form is needed for problems that included covariates.This form relies on matrices and cannot work with large data sets at this time.These methods are referred to MSIM Fast and MSIM Slow respectively.dclone and Boot Dclone methods were implemented with 5 clones each-a more reasonable setting than 100 clones for large data.
The results in Tables 6-9 indicated that glmmPQL was the fastest in the 50 subject cases and glmer was fastest in the 200 subject cases.It is not surprising that these two methods were the best in terms of speed due to the nature of the approximations that they make.dclone and Boot Dclone were slower.The fastest of either took around 4 min while the longest took nearly 25 min.MSIM Fast was faster than the MCMC methods and only slightly slower than the approximations methods at 3 to 6 s per run.The MSIM Slow method was much slower with the best case scenario of 1.5 min and the worst case of nearly 4 h.The case with 200 subjects 200 observations could not be handled because the size of matrices and vectors exceeded the storage capacity allowed by R. Introduction

Conclusions References
Tables Figures

Back Close
Full

Simulation conclusions
In general, glmer and glmmPQL were the fastest, but glmmPQL did not give good estimates.dclone gave accurate estimation but was much slower than the approximation methods.Boot Dclone was very poor in both speed and accuracy.MSIM proved to be reasonably fast in the intercept-only implementation, however, was much slower than other methods in it's matrix version and failed when too many observations were used.
It should be noted that tuning parameters within each of the methods, such as convergence criteria for MSIM or number of MCMC draws in dclone may impact the time significantly.Because the bootstrapping dclone method and glmmPQL showed poor accuracy in the simulations, neither will be used in the following application section.
5 Application in Indian rainfall extremes

Data
We applied the logit-normal mixed model to Indian rainfall data using fixed covariates of latitude ( • ), longitude ( • ), elevation (m) and daily minimum and maximum temperatures (0.1 of • C).Because it is reasonable to assume each weather station is fundamentally different, we fit a random intercept by station.Multiple datasets have been used in the literature of different temporal and spatial granularity, but initially the goal was to develop and test the methods on one widely and freely available dataset.Thus, the data were collected from the National Climatic Data Center (NCDC) in the National Oceanic and Atmospheric Administration (NOAA) from 1 January 1973 to 31 December 2013.
Data were queried for all available Indian stations in the database.This produced a total of 149 weather stations with daily precipitation as well as all necessary covariates.However, as seen in Tables 11-13, 8 of these stations have no actual observations, 14 others have less than 3 observations, and another 22 have less than 1000 out of the possible days in the date range.As an example, on 8 August 2012 (a date within the Introduction

Conclusions References
Tables Figures

Back Close
Full rainy season for most of India), there were 33 stations with NA values, 12 stations with precipitation of 0 mm, and 31 stations with greater than 0 mm precipitation.This implies several stations were not included in the data for this day and in general, stations included in the data change over time.Figure 1 illuminates the rainfall on this date.Attri and Tyagi (2010) in the Indian Meteorological Department Report defined 3 categories of rainfall: light rainfall (0 < x < 64.4 mm day −1 ), light to heavy rainfall (64.4 ≤ x < 124.4 mm day −1 ), and extreme rainfall (≥ 124.4 mm day −1 ).All stations were were marked each day with indicators of these categories to be used in the modeling.Because of the magnitude of the data, only observations considered to be within monsoon season were used.This conservatively includes the time period from 1 May to 31 October (184 days) for each year.In order to understand the changing variability by station over time we fit models for each year from 1973-2013.

Light and moderate rainfall results
First, note that all methods failed to produce estimates for the year 1975 because there were only 3 stations with observations in the data set.Fixed coefficient estimation for light rainfall are seen in Fig. 2. Within the light rainfall model, fixed coefficient estimates remained relatively constant over time in each method, but did show some movement around the year 2000 especially in the intercept, minimum temperature, and maximum temperature coefficients.All coefficients other than that of the intercept appear to be very close to 0 indicating that they may have little explanatory power in the case of light rainfall.
Moderate rainfall fixed coefficient estimates found in Fig. 3 showed more of a cyclic pattern over time.Latitude coefficients tended to increase over time as longitude coefficients decreased from slightly positive to slightly negative indicating higher incidence of moderate rainfall in the west.However, both of these coefficients were very close to 0 over time and may actually not significantly explain rainfall.Minimum temperature and Introduction

Conclusions References
Tables Figures

Back Close
Full maximum temperature also showed mirroring behavior over time but again were very close to 0. In general, the minimum temperature coefficient decreased as the maximum temperature coefficient increased.
Variability estimates for light and moderate rainfall can be found in Fig. 4a and b respectively.Within light rainfall, variability was nearly 0 over time for each of the methods.Slight increases were seen in the 1998-1999 and 2002-2004 time periods.MSIM produced a much higher, and likely anomalous, estimate for 2006 than all other methods.
Moderate rainfall still indicated relatively low variability with more of a cyclic nature over time for each of the methods.Peaks were seen in 1976,1990,[1998][1999][2002][2003][2004] time periods.Methods produced fairly similar estimates of variability save for the 2008-2012 period when MSIM estimated a slightly lower variability than the other methods.

Extreme rainfall results
Again, MSIM, glmer, and glmmPQL could not produce estimates for the year 1975.
dclone produced reasonable estimates for the extreme rainfall thus, these were used in the figures for 1975.
The results of the fixed coefficient estimation for annual data are seen in Fig. 5. Fixed coefficient estimates in the extremes remained relatively constant over time in each method.glmer disagreed with the other methods in the coefficient estimate of maximum temperature but all methods agreed for each of the other fixed coefficients.Elevation and minimum temperature coefficients were effectively 0 in all methods over time.Latitude coefficients were mostly positive and longitude coefficients were mostly negative indicating a higher estimated probability of extreme rainfall in the north west.
Variability estimates for extreme rainfall in Fig. 6a  provided by NOAA1 , a clear connection is noted.This provides a monthly average SST anomaly in the Niño region from 5 • N-5 • S, 120-170 • W at each time point.Positive values of this index correspond to El Niño events and negative to La Niña events.Looking only at the El Niño events classified as a SST anomaly of 0.5 or higher, one can see most of the higher variability estimates are following this type of event.
The largest events in the early 1980's and late 1990's appear to have an effect on the variability estimates in the models in the following 5 years.
Visually, this index indicated a similar pattern to the variability estimates by the MSIM and glmer methods in the later 1980s and MSIM and dclone in the late 1990s to early 2000s.Correlations computed between the December SST Index value and each of the variability estimates by method can be seen in Table 10.MSIM variability estimates had the highest Pearson correlation of 0.27, while dclone had the highest Spearman correlation of 0.24.Kendall's τ was not larger than 0.20 for any of the methods.These correlations seem relatively weak, however, it is well known that ENSO is not the only relevant underlying cyclic activity to affect climate.It is also likely that there exists a random lag time between an ENSO anomaly and the corresponding effect on precipitation in the Indian monsoon.

Conclusions
We have shown that it is feasible, both theoretically and computationally, to use GLMM in the context of modeling the precipitation that accompanies the Indian monsoon.
There appears to be a physical significance to the models based on the connection of the random effects in the extreme rainfall models to ENSO.This shows promise for modeling some of the unobservable physical features within the complicated network of interactions the underly extreme climate patterns while maintaining statistical Figures integrity.The current methods of GLMM estimation explored in this article all exhibit some drawbacks, but even so, converged on similar answers in the application to Indian monsoons.
Since the relevance of using GLMM in this context has been established, climate model output, such as that of CMIP5, will be explored to gain deeper intuition of the nature of this random effect.Further work on GLMMs in this context may include additional analysis of ENSO and other proposed drivers of Indian monsoons in their contributions to fixed or random effects.
Providing improvements to the GLMM estimation methods presented here, especially in computing time for MSIM, is another open research area.One limitation of GLMM, as presented in this context, is the reliance of modeling random effects as normal.Expanding the possible distributions of random effects to include extreme value distributions would be a major breakthrough in mixed modeling.Addressing the issue of providing reasonable standard errors for the variance components in GLMM would also lead to a more conclusive test of significance.Full  0.04 (0.00) 0.02 (0.00) 0.01 (0.00) 0.09 (0.01) 0.01 (0.00) 0.00 (0.00) 0.00 (0.00) Introduction

Conclusions References
Tables Figures

Conclusions References
Tables Figures

Conclusions References
Tables Figures

Back Close
Full  Full  Full  Full  Full  Full indicated even complex climate models have high uncertainty Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Simulations of the intercept only logit-normal mixed model were conducted with all combinations of 10, 50, 200, and 1000 as the number of subjects (m) and 2, 10, 50, and 200 observations per subject (n).All 5 methods were tested using 100 random data Discussion Paper | Discussion Paper | Discussion Paper |

4. 3
Simulation speed comparison 4 of the 16 settings of data were subsequently tested to determine the speed of the different methods.These were the combinations of 50 and 200 subjects with 10 and Discussion Paper | Discussion Paper | Discussion Paper | 200 observations.The system.time() command in R was used to record times.The simulations were each run on 4 computers in one R script on a single core.The computers used were: assawa: 2010 Frontier i7 8-core Intel i7 940 (2.93 GHz) with 3 GB of RAM geneva: 2011 Frontier i7 8-core Intel i7 950 (3.07 GHz) with 6 GB of RAM nokomis: 2012 Optiplex 7010 8-core Intel i7-3770 (3.40 GHz) with 8 GB of RAM tilde: 2013 Optiplex 7010 8-core Intel i7-3770 (3.40 GHz) with 8 GB of RAM Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | indicated a cyclic pattern in variability over time for each of the methods.If we compare the output of the extreme variability to the monthly Niño-3.4Sea Surface Temperature (SST) index in Fig. 6b Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

---
Number of Monte Carlo simulations: 100 -Convergence criterion for Newton's Method: Euclidean norm of change ≤ 0.01 Number of Monte Carlo simulations: 100 Convergence criterion for Newton's Method: Euclidean norm of change ≤ 0Discussion Paper | Discussion Paper | Discussion Paper | Lele, S. R., Nadeem, K., and Schumuland, B.: Estimability and likelihood inference for generalized linear mixed models using data cloning, J. Am.Stat.Assoc., 105, 1617-1625, 2010.200 Li, C. and Yanai, M.: The onset and interannual variability of the Asian summer monsoon in relation to land sea thermal contrast, J. Climate, 9, 358-375, 1996.Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Table 6 .
Total system time (in seconds) results for assawa.

Table 7 .
Total system time (in seconds) results for geneva.

Table 8 .
Total system time (in seconds) results for nokomis.

Table 9 .
Total system time (in seconds) results for tilde.