Artificial neural networks and multiple linear regression model using principal components to estimate rainfall over South America

Several studies have been devoted to dynamic and statistical downscaling for analysis of both climate variability and climate change. This paper introduces an application of artificial neural networks (ANNs) and multiple linear regression (MLR) by principal components to estimate rainfall in South America. This method is proposed for downscaling monthly precipitation time series over South America for three regions: the Amazon; northeastern Brazil; and the La Plata Basin, which is one of the regions of the planet that will be most affected by the climate change projected for the end of the 21st century. The downscaling models were developed and validated using CMIP5 model output and observed monthly precipitation. We used general circulation model (GCM) experiments for the 20th century (RCP historical; 1970–1999) and two scenarios (RCP 2.6 and 8.5; 2070–2100). The model test results indicate that the ANNs significantly outperform the MLR downscaling of monthly precipitation variability.


Introduction
The forecasting of meteorological phenomena is a complex task.The mathematical, statistical, and dynamic methods developed in recent decades help address the problem, but there is still a need to investigate new techniques to improve the results.One of these techniques is statistical downscaling, which involves the reduction of the model's spatial scale.Downscaling techniques can be divided into two broad categories: dynamic and statistical.Dynamic techniques focus on numerical models with more detailed resolution, while statistical (or empirical) techniques use transfer functions between scales.Currently, numerical weather prediction (NWP) models can forecast various meteorological variables with acceptable accuracy (Ramírez et al., 2006).
Specifically, rainfall is of great interest, both for its climatic and meteorological relevance and direct effect on agricultural output, hydropower generation and other important economic factors.However, it is one of the most difficult variables to forecast, Introduction

Conclusions References
Tables Figures

Back Close
Full because of its inherent spatial and temporal variability (Wilson and Vallée, 2002;Antolik, 2000).For this reason, the temporal and spatial scales involved are not yet solved satisfactorily by the available numerical models (Olson et al., 1995).Ramos (2000), studying artificial neural networks (ANN) and multiple linear regression (MLR), found that the neural networks method performed better than linear regression, although both showed good performance for monthly and seasonal rainfall.Ramírez et al. (2005), using observed daily rainfall in the São Paulo region, found that ANN outperformed to LMR, which showed a high bias for days without rain.Ramírez et al. (2006) analyzed daily rainfall in Southeastern Brazil and concluded that the ANN method tended to predict moderate rainfall with greater accuracy during austral summer compared to ETA model forecasts.Mendes and Marengo (2010) reported that the daily rainfall in the Amazon Basin is better represented by ANN than autocorrelation models.
In this context, the aim of this study is to conduct a statistical downscaling to estimate rainfall over South America, based on some models used in the fifth report of the IPCC (Intergovernmental Panel on Climate Change), by applying artificial neural networks and multiple linear regression using principal components.

Data
We used monthly precipitation simulations for the austral summer (December-January-February) and winter (June-July-August) generated by ten models (Table 1 duced by the Climatic Research Unit (CRU) -University of East Anglia (UEA).These data cover the period from 1901 to 2005 and have spatial resolution from 0.5 • × 0.5 • .We used climate simulations for the 20th century (historical) in the 1970-1999 period, and projections for the 21st century (Representative Concentration Pathways -RCP 2.6 and 8.5), for the period 2070-2099, as defined by Moss et al. (2010).
Our focus on South America is because it is one of the planet's regions that will be most affected by the climate change projected for the end of the 21st century (Marengo et al., 2010).According to Magrin et al. (2014), significant trends in precipitation and temperature have been ob-served in South America (SA).In addition, changes in climate variability and in extreme events have severely affected the region.The three sub-regions evaluated in South America were defined according to the precipitation regime: the Amazon (AMZ), Northeastern Brazil (NEB), and the La Plata Basin (LPB) (Fig. 1).

Artificial Neural Networks (ANN)
An ANN is a system inspired by the operation of biological neurons with the purpose of learning a certain system.The construction of an ANN is achieved by providing a stimulus to the neuronal model, calculating the output and adjusting the weights until the desired output is achieved.An entry is submitted to the ANN along with a desired target, a defined response for the output (when this is the case, the training is regarded as supervised).An error field is built based on the difference between the desired response and the output of the system.The error information is used as feedback for the system, which adjusts its parameters in a systematic way, in other words, the backpropagation error algorythm is used to train the network.According to Alsmadi et al. ( 2009) the backpropagation architecture is the most popular, effective, and easy to learn model for complex, multilayered networks.This network is used more than all others combined.This algorithm has a first phase with a functional propagation signal Introduction

Conclusions References
Tables Figures

Back Close
Full (feedforward) and a second phase with the backpropagation of the error (backpropagation).
In the first phase, the functional signal based on the inputs propagates through the network until generating an output, with the weights of synapses remaining fixed.In the second phase, the output is compared with a target, producing an error signal.The error signal propagates from the output to the input and the weights are adjusted in such a way as to minimize the error.The process is repeated until the performance is acceptable.As such, the performance of the ANN is strongly dependent on the data source.
A first part of the data is used for training, the second is used for cross-validation, and the third part is used for testing.The architecture of the ANN used in the present study can be found in Fig. 2. It consists of an input, a hidden layer and an output layer.The number of intermediate units was obtained through trial and error.During the training, the performance of the ANN is also assessed within the validation set.
The structure of the ANN used here involves training of 11 predictors (10 outputs of the models plus the observation data) as input to the network, and the best network performance is selected.We therefore expect that the ANN will be able to provide more reliable values (through the error analysis between the simulated values) than when using only climate models.

Multiple linear regression using principal components
Multiple linear regression (MLR) is a statistical technique that consists of finding a linear relationship between a dependent (observed) variable (and more than one independent variable (outputs of the GCMs).A multiple regression model can be represented by the following equation: where Y i is the dependent variable, X 1 , X 2 , . . ., X m the independent variables, a is the intercept, b 1 , b 2 and b m are the multiple regression coefficients, to be estimated by the least squares method (Wilks, 1995), and C is the error term.
In spite of their obvious success in many applications, MLRs present multicollinearity when employed with climatic variables.In this regard, the parameter estimation errors can be incorrectly interpreted (Leahy, 2000).To resolve this problem, we used principal components (PCs).This method seeks to reduce the number of variables through orthogonal transformations, and to remove the multicollinearity of the independent variables.The PCs of the explanatory variables are therefore a new set of variables with the same information as the original variables, but uncorrelated.

Validation of the ANNs
After using the precipitation simulations for the period 1970-1999 with the ANNs, we obtained a final error after a number of interactions, which ranged from 1 to 600 (Fig. 3).One of the difficulties of using ANNs consists of identifying the best stopping point for training Haykin (2001), because the training error starts out with a maximum value, decreases rapidly and then levels off, indicating there is no more error to correct.In the summer, the network became stable more rapidly, indicating that the GCMs employed converge to the same pattern of precipitation.
With respect to winter, the networks remained unstable for a longer time before finding the minimum error.The NEB region should be highlighted, which required the largest number of iterations, around 600.This is possibly related to the greater variability of rainfall in this season (Fig. 3).
According to Villanueva (2011), it is assumed that the three sets (training, validation and testing) contain independent samples, and that they are well capable of representing the problem being addressed.One should therefore expect that good performance Introduction

Conclusions References
Tables Figures

Back Close
Full on the validation set will imply good performance of the testing set.In this study, the validation values were closest to the test values in summer.

Validation of the MLR by PC
To validate the MLR, the following assumptions need to be met: (i) the residuals must have random distribution around mean zero (homoscedasticity); (ii) the residuals should have a normal distribution; and (iii) variance must be homogeneous (da Silva and Silva, 2014).Figures 4 and 5 show that the residuals versus adjusted values meet the assumption of homoscedasticity.With respect to the Q-Q plot, the quantiles of the residuals versus the normal distribution indicate that all regions present normality in the residuals.Given that the closer the residuals are to the line, the closer they are to having normal distribution.The employed data therefore fit the MLR by the PC model.Based on the PC analysis (Table 2), one can see that in summer for the AMZ region, the accumulated proportion explains around 77 % in NEB and 80 % in PC6, while in winter, the PC1 of the AMZ explained 71 % and PC3 explained 72 % in NEB, thus representing the greatest variability of precipitation in these regions.In general, one can observe that a smaller number of climate models were required in winter to capture the variance of precipitation in these regions.Similar behavior of PCs in both seasons stands out in the LPB region, which may be due to the failure of GCMs to capture the variance of precipitation in this region.
Tables 3 and 4 show the Pearson's correlation coefficients at signif icance level of 5 % between the ANNs and the observed data, and between the MLR by PCA and observed data, respectively.One can see that in both downscaling methods used, the highest correlations occur in winter in all regions under study, indicating that the models are better able to represent the variability of precipitation during this season.Ramírez et al. (2006) performed statistical downscaling for the precipitation forecast for the Southeast of Brazil, using ANNs and MLR with the ETA model.The results suggested that the precipitation forecasts using ANNs performed better in winter than in 1323 Introduction

Conclusions References
Tables Figures

Back Close
Full summer, since the synoptic forcing is more pronounced and the deep convective activity is less common.One can also observe that in the regions NEB (ANN × Obs) and LPB (MLR × Obs), the correlations of 38 and 20 %, respectively, were not statistically significant.The lowest correlation occurred in the LPB region.Seth et al. (2010) stated that the mean of the set of models reveals weaker moisture transport east of the Andes, which may be one of the factors that induce underestimation of precipitation in this region.

Downscaling scenarios
Table 5 presents the results of the monthly precipitation simulation for the end of this century (2071-2100) based on the ten GCMs described previously in the RCP scenarios 8.5 and 2.6, in relation to the reference period 1971-1999 (observation) for the two downscaling methods.
In both scenarios, and employing both ANNs and MLR, an increase of precipitation in the summer and a decrease in the winter can be observed.These results corroborate the findings of Mendes and Marengo (2010), who used ANNs and autocorrelation to study changes in monthly precipitation for the Amazon Basin in scenarios A2, A1B and B1, derived from five models of the CMIP3, used in the IPCC AR4.The authors found an increase in precipitation in the summer months and a reduction in winter.
In the NEB region (Table 5), an increase of precipitation in summer of around 30 % was observed.With respect to winter, one can see a reduction of 40 % in the higher forcing scenario (RCP 8.5), and of 10 % (RCP 2.6) in the lower climate forcing.The IPCC AR4 revealed CMIP5 precipitation projections for end of century (2081-2100) of increase precipitation from October to March over the southern part of Southeast Brazil and the La Plata Basin.From April to September, the CMIP5 ensemble projects precipitation increases over the La Plata Basin and northwestern SA near the coast (Stocker et al., 2013).According to Magrin et al. (2014)  for the late 21st century and for the RCP 8.5 when compared to scenario RCP 2.6, as can be seen in Table 5.

Conclusions
This paper investigated the applicability of artificial neural networks and multiple linear regression analysis by principal components, as temporal downscaling methods for the generation of monthly precipitation over South America (for current years and future scenarios).Both the ANN and MLR methods provided good fit with the observed data.This indicates that ANNs are a viable alternative for the modeling of precipitation in time series.ANNs can be compared with the statistical model, and this indicates that the networks are a potentially competitive tool.
The future scenarios used (RCP 2.6, lower climate forcing, and RCP 8.5, higher climate forcing) indicate an increase in precipitation in summer and a reduction in precipitation during winter according to both the methods used.
In general, the results showed that the use of ANNs produced more accurate results than MLR by PC, which can be attributed to the fact that ANNs perform tasks that a linear program is unable to do.In addition, one of the advantages of ANNs is their capacity for temporal processing, and thus their ability to incorporate not only concurrent, but also several predictive values, as inputs without any additional effort.Introduction

Conclusions References
Tables Figures

Back Close
Full

Conclusions References
Tables Figures

Back Close
Full  Full  Full  Full Discussion Paper | Discussion Paper | Discussion Paper | ) from the CMIP5 project (Coupled Model Intercomparison Project 5th Phase), obtained from the Earth System Grid Federation (ESGF) of the German Climate Computing Center (http://ipcc-ar5.dkrz.de)and the Program for Climate Model Diagnosis and Intercomparison (http://pcmdi3.llnl.gov).All model simulations for the 20th century were compared with the precipitation data of the CRU TS 3.0 (Mitchell and Jones, 2005), pro-Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | seasonal scales, rainfall reductions during winter and spring in southern Amazonia may indicate a late onset of the rainy season in those regions and a longer dry season.The changes are more intense Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Table 1 .
List of models from the CMIP5 dataset used in this study.

Table 3 .
p value and Pearson's correlation coefficient at the level of significance of 5 % between the ANNs and observed data from the CRU in all regions under study.

Table 4 .
p value and Pearson's correlation coefficient at the level of significance of 5 % between the MLR by PCs and observed data from the CRU in all regions under study.

Table 5 .
Change in monthly precipitation in terms of an increase or decrease by the end of this century (2071-2100) in the scenarios RCP 8.5 and 2.6, in relation to the reference period1971-1999 (observation), in mm month −1 and percentage.