An improved ARIMA model for precipitation simulations

Abstract


Introduction
Hydrological processes are complicated; they are influenced by not only deterministic, but also stochastic factors (Wang et al. 2007).The deterministic change in a hydrological process is always accompanied by the stochastic change.Generally speaking, determinism includes periodicity, tendency, and abrupt change.A strict deterministic hydrological process is rare.Stationary time series has been widely used in hydrological data assimilation and prediction to tackle the stochastic factors in hydrological processes.From the point of view of stochastic processes, hydrological data series usually comprises trend term and stationary term.The basic idea of Auto Regressive Integrated Moving Average (ARIMA) model, one of the most commonly used time series model, is to remove the trend term of series by difference elimination, so that a nonstationary series can be transformed into a stationary one.Some researchers have used ARIMA model for the analysis of hydrological process without considering the effects of seasonal factors (Jin et al. 1999;Niua et al. 1998;Toth et al. 1999).However, most studies (Ahmad et al. 2001;Lehmann et al. 2001;Qi et al. 2006) neglected stationary test and the influence from inter-monthly variation within a year.In this paper, the seasonal ARIMA model is improved by removing the effect of seasonal factors, and the improved model is tested through a case study.The paper is organized as follows: the ARIMA model is introduced first, followed by the introduction of the issues in the currently existing ARIMA model and our proposed methods to improve it.A case study is conducted and discussion is addressed finally.

ARIMA model
A hydrological time series {y , 1,2, , } t tn  could be either stationary or nonstationary.Given that there are essentially no strictly deterministic hydrological processes in nature, the analysis of hydrological data by means of nonstationary time series is of importance, among which ARIMA model is one of the available choices.

ARIMA model
For a stationary time series, ARMA ) , ( q p model is defined as follows: Where p denotes the autoregressive (AR) parameters, q represents the moving average (MA) where () B


is the autoregressive operator and () B  is the moving-average operator.
Then the model can be simplified as If   t y are nonstationary, we can obtain the stationarized sequence t z by means of difference, i.e., (1 ) where d is the number of regular differencing.Then the corresponding ARIMA ) , , ( q d p model for t y can be built (Box et al. 1997), where d is the number of differencing passes by which the nonstationary time series might be described as a stationary ARMA process.

Seasonal ARIMA( , , ) p d q model
Most hydrological time series have obviously seasonal (quasi-periodic) variation (Box et al. 1967), representing recurring of hydrological processes over a relatively (but not strictly) fixed time interval.Monthly data series often shows a seasonal period of 12 months while quarterly data series always present a period of 4 quarters.Seasonality can be determined by examining whether the autocorrelation function of the data series with a specified seasonal order is significantly different from zero.For instance, if the autocorrelation coefficient of a monthly data series with new data series formed by a lag of 12 months is not significantly different from 0, the monthly data series does not have a seasonality of 12 months; if the autocorrelation coefficient is significantly different from 0, it is very likely this monthly data series has a seasonality of 12 months.A seasonal ARIMA model can be built for a data series with seasonality.
For a time series   t y , its seasonality can be eliminated after D orders of differencing with a period of S .If a further d orders of regular differencing is still needed in order to make the data series stationary, a seasonal ARIMA can be built for the data series as follows, ( ) ( )(1 ) (1 ) ( ) ( ) where P is the number of seasonal autoregressive parameter, Q is the seasonal moving average order, S is the period length (in month in this work), and D denotes the number of differencing passes.

Implementation of ARIMA model
The procedure of estimating ARIMA model is given by the flowchart in Fig. 1 which involves the following steps: (1) Stationary identification.The input time series for an ARIMA model needs to be stationary, i.e., the time series should have a constant mean, variance, and autocorrelation through time.
Therefore, the stationarity of the data series needs to be identified first.If a time series is identified nonstationary, differencing is usually made to stationarize the time series.In the differencing method, the correct amount of differencing is normally the lowest order of differencing that yields a time series which fluctuates around a well-defined mean value and whose autocorrelation function (ACF) plot decays fairly rapidly to zero, either from above or below.The time series is often transformed for stabilizing its variance through proper transformation, e.g., logarithmic transformation.Although logarithmic transformation is commonly used to stabilize the variance of a time series rather than directly stationarize a time series, the reduction in the variance of a time series is usually helpful to reduce the order of difference in order to make it stationary.
(2) Identification of the order of ARIMA model.After a time series has been stationarized, the next step is to determine the order terms of its ARIMA model, i.e., the order of differencing, d for nonstationay time series, the order of auto-regression, p , the order of moving average, q , and the seasonal terms if the data series show seasonality.While one could just try some different combinations of terms and see what works best strictly, the more systematic and common way is to tentatively identify the orders of the ARIMA model by looking at the autocorrelation function (ACF) and partial autocorrelation (PACF) plots of the sationarized time series.The ACF plot is merely a bar chart of the coefficients of correlation between a time series and lags of itself and the PACF plot present a plot of the partial correlation coefficients between the series and lags of itself.The detailed guidelines for identifying ARIMA model parameters based on ACF and PACF, can be found elsewhere, e.g, Pankratz (1983).It should be noted that, to be strict, the ARIMA model built in this step is actually an ARMA model with if the time series is stationary, which is in fact a special case of ARIMA model with 0 d  .
(3) Estimation of ARIMA model parameters.While least square methods (linear or nonlinear) are often used for the parameter estimation, we use the maximum likelihood method (Mcleod, 1983;Melard, 1984) in this paper.A t -test is also performed to test the statistical significance.
(4) White noise test for residual sequence.It is necessary to evaluate the established ARIMA model with estimated parameters before using it to make forecasting.We use white noise test here.If the residual sequence is not a white noise, some useful information has not been extracted and the model needs to be further tuned.The method is illustrated as follows.
Null hypothesis: 0 : corr( , ) 0 , Alternative hypothesis: The autocorrelation of the data series is measured by the autocorrelation coefficient which is defined as where n is the number of cases of sample of series for white noise test, m is the maximum number of lag.In practice, m uses the value of The test statistics is given by 2 1 ( 2) Given the degree of confidence of Then Q fits the 2  distribution at the significance of 1   and the null hypothesis is accepted.
(5) Hydrological forecasting.The linear least squares method is usually applied for rainfall-runoff prediction.In general, based on the n observation values, the values of future L time steps can be estimated (Kohn et al. 1986).

Improvement of conventional ARIMA model
Seasonal ARIMA models apply for time series which arranges in order with a certain time interval or step, e.g., a month.However, in this case, while the seasonal ARIMA model is capable of dealing with the inter-annual variation of each monthly of a monthly data series, the information of inter-monthly variation of the time series may be lost.For example, after an order of 12 of seasonal differencing (term S in a general seasonal ARIMA model) of a monthly time series, the original monthly series has been migrated to a new time series without seasonality.A nonseasonal ARIMA model is then fitted to the new time series where the inter-monthly variation of original monthly time series has also migrated to the inter-monthly variation of the new series after seasonal differencing.
The transformation of inter-monthly variation of original monthly data to the new inter-monthly variation of seasonally differenced series may result in loss of accuracy of model performance.In this study, twelve individual seasonal ARIMA models for precipitation prediction for each month are built from each monthly data series, e.g., the January data series from 1951 to 2000, which are referred to as ARIMA models of inter-annual variation ignoring the inter-monthly variation.
In order to prevent from losing the inter-monthly variation information, we propose in this study the following improvement to the conventional seasonal ARIMA model, which simultaneously takes into account both kinds of temporal variation (inter-annual variation and inter-monthly variation).
Clustering analysis is first applied to classify the monthly data series and extract characteristics of each data series class (Sun et al. 2005).In this study, we use Euclidean distance as the distance measurement in clustering analysis.The characteristics of each data series refer to the maximum, minimum, and truncated mean of the series of this class.A linear regression model is then built with hydrological variable to be predicted, e.g., monthly precipitation, as dependent variables and with maximum, minimum, and truncated mean of each class as independent variables in the linear regression model.For example, a monthly precipitation would be described as a linear regression function of the maximum, minimum, and truncated mean of the data series of a class where this month's precipitation has been clustered in the clustering analysis.A conventional seasonal ARIMA model is built for the maximum, minimum, and truncated mean of each class, respectively, accounting for the inter-monthly variation of each characteristic variable.By this way, we are trying to avoid losing the inter-monthly variation information.The implementation of the improved ARIMA model involves the following procedure, as illustrated in Fig. 2. i).Perform clustering analysis on monthly data, and group the months with similar hydrological variation.
ii).Find the maximum, minimum, and truncated mean of each cluster.
iii).Build linear regression models and determine the associated parameters for each monthly data series.For example, for the precipitation in the i -th month, class where the time series of the i -th month is identified in cluster analysis.iv).Build ARIMA models for the maximum, minimum, and truncated mean of each class and predict the characteristics for the time year of interest using the established ARIMA models.v).Substitute the predicted characteristics into the linear regression model built in Equation ( 10) and obtain the monthly hydrologic variable, say precipitation.

Case study
In this section, we are presenting an application of the proposed improved ARIMA model to the precipitation forecasting of Lanzhou precipitation station in Lanzhou, China.Lanzhou is located in the upper basin of Yellow River.It has a continental climate of mid-temperate zone, with an average precipitation of 360 mm and mean temperature of 10℃.In general, rainfall seasons are May through September, while drought occurs in spring and winter.The Lanzhou precipitation station is located at 103.70°E, 35.90°N.The monthly precipitation data from 1951 to 2000 is used for parameter estimation and the monthly precipitations of 2001 are then predicted using the proposed model and compared with the observation values.In order to show the improvement of this present approach, we first build a conventional seasonal ARIMA model and a set of 12 ARIMA models for each monthly precipitation series which account for the seasonal variation.The improved ARIMA model accounting for both inter-month and inter-annual variation of monthly precipitation time series is then built using the presented approach and its prediction results are compared with the conventional ARIMA model and seasonal ARIMA model, as well as auto-regressive models.

Conventional seasonal ARMA modeling
The precipitation at the Lanzhou precipitation station from 1951 through 2001 and from 1991 through 2001 are plotted as shown Fig. 3 (a) and (b) respectively.The two figures show less precipitation in winter and spring and more in summer and autumn.Fluctuation occurs to the data during high precipitation seasons.Using power transformation with an order of 1/3, fluctuations at high values are removed and the data become stationary, as shown in Fig. 3(c).According to autocorrelation and partial correlation functions, as shown in Fig. 4, seasonal term with a period of 12 exists.With the difference elimination method, the order of the model can be determined from, and the following seasonal ARIMA model is obtained.
The maximum-likelihood method is then used for parameter estimation and the results are listed in Table 1.As shown in Table 1, parameter estimation is statistically significant.A white noise test is performed for the residual sequence.If the test does not pass, the model needs to be improved.As shown in Table 2, with a significance level of 5%, the test is passed, i.e., the useful information is extracted and the model is acceptable.

Individual ARIMA model for each month data series
As discussed in Section 2.2, the data can be classified into 12 groups associated with each month respectively.Stationary identification, stationary treatment, model identification, parameter estimation and residual test are performed for the 12 groups of data.A total of 12 ARIMA models are built and the estimated parameters are shown in Table 3.

The improved ARIMA model based on clustering and regression analysis
Box-Cox transformation is applied as a pretreatment of data for clustering analysis in order to stable the variance of the monthly precipitation data series.Given that the precipitation has values of zero resulting in negative infinity in the transformation, Box-Cox transformation (Thyer et al., 2002;Meloun et al., 2005;Ip et al., 2004) is corrected as follows.
(original data 1) 1 0 Data after transformation log( ) 0 original data After Box-Cox transformation, as shown in Fig. 6, the data are much more symmetric than the original data series, which is helpful for the later clustering analysis.Moreover, it can be seen that there are many zero precipitation values in the raw monthly precipitation data series and so does the transferred data.This indicates that the samples of data sequence may not be from one individual population but from multiple populations which further implies the necessarily of clustering analysis for the data series.Clustering analysis with Euclidean distance is then applied which indicates that the monthly precipitation sequences can be clustered into three classes, as shown in Fig. 7.
It is interesting that the clustering results are mostly coincides with the precipitation season.For example, Class 1 looks like corresponding to the drought season while Class 3 corresponds to the rainfall season.After the clustering analysis to the monthly precipitation time series, the characteristics of each class, i.e., maximum, minimum, and truncated mean, are identified, as shown in Fig. 8. Whereas fluctuations in the mean and minimum data series are relatively small, relatively larger variation are shown in the maximum data series.
Linear regression models for each monthly precipitation are fitted using the characteristics of each class where the monthly precipitation data series is located.The parameters corresponding to each linear regression model are presented in Table 4 which pass the t -test at the significance of 0.05 indicating that those linear models fit their data series well respectively.Following the steps described in Section 2.3, nine ARIMA modes are built for each of the characteristic variables of each class.The estimated parameters are shown in Table 5. Auto-regressive models with orders of 24 and 36, or AR (24) and AR (36), are also fitted to the monthly precipitation time series for comparative study with the improved ARIMA model and conventional ARIMA model.

Results and discussion
The monthly precipitations of 2001 are predicted using the improved ARIMA model as well as the conventional seasonal ARIMA model, the 12 seasonal ARIMA models for the precipitation of each month, and AR(24) and AR(36) models, the prediction results shown in Table 6 and Fig. 9.The absolute error of each method is 9. 41, 11.49, 11.78, 17.05, and 17.82While the improved ARIMA model catches the correct trend overall and predicts the monthly precipitation in most months with high accuracy, it predicts highly accurately for the dry seasons, such as January, February, March, November, and December.However, it overestimates the precipitation of July and October and underestimates the September precipitation significantly.After a closer look at the data, we find that the mean precipitations of July and October are 63.8 and 23.48 mm over the period of 1951 through 2000, respectively, whereas the observation precipitations of significantly higher than that of the conventional ARIMA model.This improved approach can be applicable to other hydrological processes prediction with time series data, such as runoff, water level, and water temperature.
Apparently, the present model could be further improved, especially for the prediction of extreme phenomena.Given that the selection of clustering method does affect model performance, different clustering methods, e.g., the definition of distance in the hierarchical clustering can be applied (Wang et al. 2005) to obtain better fittings.Characteristics value should be constructed by the features of hydrological time series, not limited to the extreme or mean values.A higher order of regression model rather than the linear regression may be used for the hydrologic forecasting.Last but not the least, artificial intelligence approaches, such as neural network or support vector machine, can be used to further improve the proposed ARIMA model.
coefficients in the model for the i -th month hydrologic parameter, e.g., precipitation, which need to be estimated, maximum, minimum, and truncated mean of the j -th mm for the improved ARIMA model, conventional ARIMA model, individual ARIMA for each month data series, AR(24), and AR(36), respectively, indicating that the improved ARIMA presented in this paper performs the best with the smallest errors.Compared with the conventional ARIMA model, the improved ARIMA model increases the prediction accuracy by 24%.The conventional ARIMA model predicts accurately for March, June, August, ad November but mismatches the other months' precipitation.It predicts more accurately for October precipitation than the improved ARIMA model.The 12 individual ARIMA models for each month data series performs similarly to the conventional ARIMA model.The overall performance of AR(24) model does not show difference from that of AR(36) model; neither models perform as good as the improved ARIMA model or the conventional ARIMA model.However, the AR models give a better prediction for September precipitation of 2001 than the other two models.

Table 2 . Autocorrelation of the residuals of the conventional seasonal ARIMA model
*: Autocorrelations of residue for lag 1 through lag 48, 6 lags per row from Column 5 through 10.

Table 4 . Estimated parameters for linear regression models
See Eq. (10) for definition.