Predicting sea surface temperatures with coupled reservoir computers

. Sea surface temperature (SST) is a key factor in understanding the greater climate of the Earth, and it is also an important variable when making weather predictions. Methods of machine learning have become ever more present and important in data-driven science and engineering, including in important areas for Earth science. Here, we pro-pose an efﬁcient framework that allows us to make global SST forecasts using a coupled reservoir computer method that we have specialized to this domain, allowing for tem-plate regions that accommodate irregular coastlines. Reservoir computing is an especially good method for forecasting spatiotemporally complex dynamical systems, as it is a machine learning method that, despite many randomly selected weights, is highly accurate and easy to train. Our approach provides the beneﬁt of a simple and computationally efﬁcient model that is able to predict SSTs across the entire Earth’s oceans. The results are demonstrated to generally follow the actual dynamics of the system over a forecasting period of several weeks.


Introduction
As most of Earth's surface is covered by water, global sea surface temperatures (SSTs) are an important parameter in understanding the greater climate of the Earth. The SST is an important variable in the study of marine ecology (Gomez et al., 2020;Novi et al., 2021), weather prediction (Dado and Takahashi, 2017) and to help project future climate scenarios (Pastor, 2021). However, the task of predicting changes in the SST is quite difficult due to large variations in heat flux, radiation and diurnal wind near the surface of the sea (Patil et al., 2016).
Given the importance of the SST to the fuller Earth weather and climate system, there is significant interest in forecasting this spatiotemporally complex process. Methods for predicting changes in the SST can be divided into two different categories: numerical methods and data-driven methods. Numerical methods are based on the underlying knowledge of the governing physics behind the system and the simulation thereof. These are widely used to predict SST over a large area (Stockdale et al., 2006;Krishnamurti et al., 2006). However, data-driven methods encompass statistical and machine learning approaches, and they are widely used to predict the SST, often with little to no knowledge regarding the relevant physical principles behind the dynamics of the system, thereby reducing the complexity of the model. Several statistical methods that have been used include Markov models (Xue and Leetmaa, 2000;Johnson et al., 2000), linear regression (Kug et al., 2004) and empirical canonical correlation analysis (Collins et al., 2004). Meanwhile machine learning methods have included techniques such as support vector machines (SVMs; Lins et al., 2013), long short-term memory neural networks (LSTMs; Zhang et al., 2017;Kim et al., 2020;Xiao et al., 2019a) and memory graph convolutional networks (MGCNs; Zhang et al., 2021). In this paper, we utilize coupled reservoir computers (RCs), thereby taking advantage of the reduced complexity of data-driven methods while still being able to predict temperatures globally due to the minimal training required by each RC. We have adapted the RC concept for spatiotemporal processes to allow for coupled local templates that accommodate the peculiarities associated with varying coastlines.
Reservoir computers have been shown to be excellent at predicting complex dynamical systems (Ghosh et al., 2021;Pandey and Schumacher, 2020), regardless of the relative simplicity of the approach. They have even been shown to be proficient in the prediction of spatiotemporally complex systems, such as the Kuramoto-Sivashinsky partial differential equation (Vlachas et al., 2020) and cell segmentation (Hadaeghi et al., 2021). In our case, the use of coupled reservoir computers is introduced due to the large number of points on the map, making it computationally challenging to utilize a single large reservoir computer. The reservoirs are coupled together by making the reservoirs functions of points outside their forecast domain, effectively creating overlap.

Background
Reservoir computing is a type of recurrent neural net where the only layer trained is the output layer; this is done with a simple linear method. Compared with traditional recurrent neural networks, reservoir computers utilize randomly generated input and middle weights, which, in effect, reduces training time significantly. The reservoir computer is stated as follows (Jaeger and Haas, 2004): The inputs X i (of total length dx) represent the raw data which describe a system; in our case, these would be the temperatures at points in the sea. These are fed into the reservoir via the input matrix W in , which has weights that are determined via sampling a uniform distribution U (−σ, σ ). The reservoir dimension N describes how many nodes there are to be within the reservoir. Therefore, in order to transform the inputs into the space of the reservoir, the dimensions of W in will be N × dx. The reservoir state r, which evolves according to Eq.
(1), carries information about the current and previous states of the system. During training, the reservoir states are horizontally concatenated as time evolves to form the matrix R = r 1 |r 2 |. . .|r t train , where t train is the number of data points being used to train the model and, thus, the computational complexity associated with the matrix operations stated in Eq.
(2). The dimensions of R at the end of the training phase will then be N × t train .
The reservoir matrix A, which contains the middle weights, is a sparse matrix with a set density d, with nonzero values that are sampled from a uniform distribution U (−β, β). The matrix A will have dimensions of N × N, corresponding to the N reservoir nodes. The spectral radius ρ of A is an important metaparameter in the formation of the RC (Jiang and Lai, 2019), and it can be adjusted by scaling A (which essentially just involves changing β). The activation function q(s) is usually picked to be a nonlinear function such as the sigmoid function or the hyperbolic tangent function. Often, a bias term b is included when one desires to shift the activation function a set amount. After the training dataset is cycled through and the matrix R is completed, the output matrix W out is then found via a ridge regression: which utilizes the identity matrix I scaled by a regularization parameter λ to prevent overfitting. As the output matrix is transforming reservoir states to a desired output Y i+1 (of length px), the dimension of W out will be px × N. The trained model can then be used to forecast autonomously by inserting the newly predicted values Y i+1 back into the reservoir on the next iteration as X i .

Coupled reservoir computers
As we would like to predict the SST across the entire globe, it is computationally challenging to use a single reservoir computer to forecast for the entire map due to the number of points on the map. From the dataset, there will be a total of n · m points on the map. If we were to use a single reservoir computer, the size of the input dataset X i would be around 0.71 · n · m (as 71 % of the Earth's surface is water) for each time step. It will be seen in Sect. 4 that the values of n and m that are ultimately used to train our model are 120 and 240, respectively; hence, the total number of points that are not on land across the map is about 20 448. Thus, even if the reservoir dimension N is 15 : 1 with respect to the size of the input dataset (which is relatively small; e.g., the value of N commonly used for the 3D Lorenz system is 1000; Bollt, 2021), the size of the resulting matrix A would be 306 720×306 720, which, assuming each element requires 8 B, would require 750 GB to store. Therefore, in this application, it is better to follow the methodology laid out by Pathak et al. (2018) to model the evolution of the SST with the use of smaller reservoir computers that each cover a small domain of the map. This approach is advantageous because the input and middle weights for a reservoir computer are chosen randomly, thereby allowing us to reuse the same input matrices 1 W in and the same reservoir matrix A between the individual RCs. Now, each individual RC learns the local behavior of the points within its forecasting domain (defined as a pack) while still being connected to the global system via coupling with its neighbors.
A pack is essentially a collection of contiguous indices on the greater map that will be assigned to a given RC. In other words, a pack is an RC model of the dynamics, for a local Figure 1. The various packs on a sample map. The points within a pack that are blue are the eligible points in the ocean that the pack's designated reservoir computer will attempt to forecast. Points in green represent land, which are not eligible points and whose indices are ignored in the formation of the pack. region of the globe, that is designed to accommodate any local peculiarities of the land-water interface and also couples into other neighboring packs. This reservoir computer will then be solely responsible for predicting the temperatures within its pack as time evolves. For simplicity's sake, in this study, points within a pack are grouped in the shape of rectangles, where the number of rows of points within the pack is defined as n pack and the number of columns subsequently m pack . Therefore, each RC will be responsible for forecasting a maximum of n pack · m pack spatial points on the map. As there are n·m total points on the map, the number of RCs needed will then be P f = (n · m) / n pack · m pack . An illustration of the various packs on a sample map are provided in Fig. 1.
No points containing land are assigned to a RC; therefore, some packs will have more or less points than others, due to these points on land. It should also be noted that no point on the map can be in more than one pack, as packs do not overlap. The SSTs of the points within the j th pack at the ith day are stacked in the vector X j pack i , which is of length px: For the various reservoirs to interact with one another, coupling is introduced by including the SST of the neighbors surrounding a pack. The neighbors of a pack are the non-land points that are either directly touching or are on the corner of a point within the pack. As many points within a pack share similar neighbors, only unique neighbors are kept, and their SSTs at the ith day are compiled into the vector X j neighbor i . The neighbors for a given pack on the sample map are illustrated below in Fig. 2.
The vectors X j pack i and X j neighbor i are combined to form the vector X j i (of length dx), which contains all of the SSTs for Figure 2. Illustration of the j th pack with its valid neighbors denoted by a yellow highlight. In this example, it is evident that px is equal to 32 (number of pack points in the ocean) and dx is equal to 54 (px plus the number of neighbors). It is also evident that n pack and m pack are both equal to 6 in this case. the pack and its neighbors at the ith day and is ultimately the input for the j th reservoir computer.
As we would like the model to predict SSTs within the pack at the next day Y j pack i+1 , the reservoir computer for the j th pack can now be written as follows: Each reservoir computer is trained over the entire training dataset from i = 1 : t train d. Each pack contains a distinct output matrix W j out , such that the reservoir states are matched with the values of the SST within the pack at the next day. In order to save computer memory, it is advisable to create an array of input matrices with values of dx from 1 to n pack + 2 · m pack + 2 and then assign these to RCs with a similar value of dx, denoted by W dx in . One middle-weight matrix A is shared between all of the RCs, as the reservoir dimension N is set to be fixed between all of the reservoirs. The architecture of a single reservoir computer is described in Fig. 3. The years 2003 to 2020 of the dataset were selected to form the training and validation dataset. The data are given in an equirectangular format, which is used throughout the modeling process for simplification purposes, even though this consequentially leads to a more refined mesh near the poles. The time series data were split into a training and validation dataset, consisting of 6533 and 42 d, respectively. We choose not to normalize the data, as the data are univariate and the reservoir state is effectively scale-free, with scale being reintroduced with the trained output matrix.
In order to reduce the number of spatial points within the dataset, the data were discretized such that the SST was now on a global 1.5 • degree grid. This was performed by grouping original data points in a 6 × 6 matrix and then taking the average over the group. If a grouping contained a point on land, this value was ignored in the computation of the average of the group. Therefore, the dataset for a given day went from a 720×1440 to a 120×240 dataset; hence, n = 120 and m = 240.

Forecasting
To subsequently forecast the global SST with the trained model, two different prediction types are performed. To begin testing the short-term accuracy of the model, daily predictions are performed over the course of 6 weeks. The actual values of the SST are fed as inputs into the reservoir during this time period, and the predicted values for the next day are read out. This type of forecast has a real-world application in the form of filling in SST datasets when there is cloud cover or data corruption (Case et al., 2008), as the model can be used to estimate data for the missing days. Then, to test the long-term accuracy of the model, the model is allowed to run autonomously over the same 6 weeks. Now, the model is still predicting SST each day, but it only has access to its own previous prediction. This form of the forecast would be more applicable to weather prediction, as it could be coupled with an atmospheric model to help predict near-future weather patterns.
For both prediction types, the reservoir states are all cleared to zero prior to forecasting and then run over t warm up days prior to the validation time frame, thereby providing an initial condition for the model to begin from. Via cross validation, several metaparameters (σ , N and t warm up ) were optimized, and the values that were found to perform the best are described in Table 1. It was noticed that the results were not significantly increased for a value of t warm up greater than 35 d. Results did consistently improve with an increasing reservoir dimension N , leading us to choose the value N = 1000. One of the more sensitive metaparameters was the value of σ , which was found to provide optimal results when it had a magnitude of the order of 10 −4 . In the spirit of simplicity, we choose not to rigorously optimize the remaining metaparameters; rather, we choose them based off of heuristics. The model was implemented using the MATLAB R2019b platform, from a personal laptop, and the training time was slightly over 40 min. It is likely, given the embarrassingly parallelizable nature of this task, that an implementation that leverages GPU-style computation could speed this stage up considerably. To observe the effect of the random input and middle weights on the model, 15 different models are created that all have the same metaparameters as described in Table 1. Even though the model predicts SST across the entire Earth, several time series at points chosen arbitrarily are included in each section to locally validate the forecast, the coordinates of which can be found below in Table 2. In this local time series analysis, the forecasts from the 15 different models are averaged, and ±1 standard deviation in the predicted values is represented above and below the average value using a shaded outline.
To determine the quality of the forecast over the entire ocean, the mean absolute error (MAE), the root-mean-square error (RMSE) and the maximum error in the forecast for a given day across the entire map are found. To find the MAE in the forecast at a given day, a weighted average is performed on the error e i across the map, where e p,i denotes the absolute error at point p on the map at the ith day. We perform this weighting due to the mesh being more refined near the poles compared with points near the Equator; hence, the area attributable to a given point p is not necessarily the same for other points. It should also be noted that we treat areas near the poles consisting of time-dependent sea ice simply as points within the ocean. The actual area encompassed by a given point was found by simply using MAT-LAB's built in "areaquad" function. The MAE in the forecast across the map at the ith day is then given by Eq. (8), where k is the number of points on the map that lie in the ocean (k ≈ 0.71 · n · m).
Similarly, the RMSE on the ith day is then given by Eq. (9).
Finally, the maximum error is simply the largest error in the forecast across all points on the ith day. These error values are found for each of the 15 models every day in the forecasting period; subsequently, the average values and standard deviations between models are found to observe the effect of randomness.

Daily forecasts
Daily forecasting operates by continually inserting the actual SST of the previous day X i into the reservoirs and then reading out the model prediction of the SST at the next day Y i+1 , subsequently repeating this procedure over the course of the validation time frame. The time series for the forecasted SSTs at the 10 different points are provided below in Fig. 4 and are also matched with the actual SSTs each day in Fig. 5. From Fig. 4, it is apparent that the forecasted SSTs closely follow the actual SSTs over the validation time frame for almost all locations. The model is seen to pick up on the changes in SST at locations where there is not a clear trend Table 3. Regression statistics for the daily forecasts. Note that the slope m would ideally be equal to 1, the intercept b would be equal to 0 and the correlation coefficient r would be 1. Standard errors are provided for both m and b. and the change in SST is seemingly chaotic, such as locations h, i and j, although there are some inaccuracies with the forecast at location g. There is also very little deviation between models, indicating that the effect of randomness on the model is practically negligible with regard to forecasting the daily SST. The correlation coefficients for the chosen sites are all 0.88 or greater, and the value of m is close to 1 (with minimal standard error) for most locations, indicating a strong relationship between the model's forecast and the actual values. To quantify how the model performs over the entire globe the MAE, RMSE and the maximum error for the daily forecasts are described below in Fig. 6. From Fig. 6, for each daily prediction, we see that the average MAE between models typically fluctuates between 0.13 and 0.15 K, the RMSE fluctuates between 0.18 and 0.24 K, and the maximum error fluctuates between 1 and 8 K. By calculating the mean of the model averages over the course of the 6-week forecasting period, we find that the model has as an average MAE of 0.136 K, an average RMSE of 0.197 K and an average maximum error of 3.308 K for a 1 d prediction horizon. For context, for a 1 d prediction horizon, Xiao et al. (2019b) reported a RMSE of 0.35 K for their forecast  over the East China Sea with the use of a deep learning model constructed from ConvLSTM (convolutional long short-term memory) networks, and Shi et al. (2022) reported a RMSE of 0.241 K for their cyclic evolutionary network model forecasting over the South China Sea. It should also be noted that our error values typically do not decrease over the forecasting period, indicating that the chosen warm-up time of 35 d is sufficient, as there would be a decline in the error over time if the reservoir was gradually benefiting from more provided information.

The 6-week forecast
Meanwhile, for the 6-week forecast, the forecasted SSTs Y i+1 are input back into the reservoir on the next day, thereby taking the place of the actual SST X i+1 . This effectively allows the model to run autonomously over the validation time frame for a total of 42 d. The 10 time series for the forecasted SSTs are provided below in Fig. 7.
From Fig. 7, it is apparent that the model typically predicts the general change in the SST for locations a, b, c, g, h and j. We find it especially impressive that the model is able to predict the cooling of the SST found at g and j and the warming at h. Meanwhile, the results for locations d, e and f are very poor given that the change in SST is fairly linear in the time Figure 6. The evolution of the error for the 1 d forecasts. The mean absolute error in the forecast over the entire ocean is depicted in panel (a), the root-mean-square is shown in panel (b) and the maximum error is presented in panel (c). The model average is represented using a dark blue line, and ±1 standard deviation is represented by the shaded outline above and below the average. leading up to the forecasting period and during it. The results at location i are also poor, as the model is unable to anticipate the rise in SST that began around the commencement of the forecasting period. These forecasts also depict how the intrinsic randomness of the model does play a slight role in the model output as time evolves, but it is an interesting feature that the standard deviation between models appears to reach a limit after several days of forecasting, as seen in Fig. 7d, e, g, h, i and j, and the standard deviation between models even decreases, as seen in Fig. 7c. To quantify how the model performs globally, we now refer to Fig. 8, which depicts the error in the SST forecast (which is an average between the 15 models) as well as the MAE, RMSE and maximum error in the forecast described by Fig. 9.
From Fig. 8, it is observed that the model generally performs the best near the Equator, with some slight difficulty predicting the general warming of the ocean in the Southern Hemisphere and cooling in the Northern Hemisphere during this time frame. The average MAE and RMSE rise to 0.32 and 0.45 K, respectively, on the seventh day of forecasting, subsequently increasing to respective values of 0.73 and 0.99 K by the end of 4 weeks. For context, for a prediction horizon of 1 week, Xiao et al. (2019b) reported a RMSE of 0.85 K and Shi et al. (2022) reported a RMSE of 0.687 K for their respective models mentioned previously. Meanwhile for a prediction horizon of 4 weeks, Yang et al. (2018) reported a RMSE of 0.726 K for their CFCC-LSTM (fully connected long short-term memory and convolutional network) model forecasting over the Bohai Sea and a RMSE of 1.070 K for the Chinese coastal waters. Similar to the daily forecasts, it is also apparent that there is little deviation in the RMSE and MAE across the 15 models, once again indicating that the models typically perform similarly regardless of the random weights that they are constructed from.

Conclusions
With the use of coupled reservoir computers, specifically a collection of patches that represent local regions and are designed to accommodate land-water interface variations, we were able to model for excellent forecasting of the spatiotemporally complex dynamics of the global SST over several weeks. The relative simplicity of the network architecture and the minimal training time is striking relative to other machine learning concepts. Even though our model is in-tended to describe the dynamics of the entire ocean, it is still able to predict the SST at specific locations. In the future, it is of interest to explore the use of next-generation reservoir computers (NG-RCs) in the task of predicting SST, as NG-RCs provide the added benefit of less metaparameters to tune compared with a traditional RC (Jaeger and Haas, 2004;Bollt, 2021;Gauthier et al., 2021). It is also of interest to input other variables into the reservoir besides the SST, such as the surrounding air temperature (Jahanbakht et al., 2021), in order to observe if the results can be further improved.
Data availability. The SST dataset, titled "GHRSST Level 4 MUR 0.25 deg Global Foundation Sea Surface Temperature Analysis (v4.2)", used in this study is publicly available online via the Physical Oceanography Distributed Active Archive Center (PO.DAAC, 2019).
Author contributions. EB conceived the idea for the study, and BW was responsible for the creation of the model. BW and EB both contributed to the analyses and the creation of the paper.