Overall, this paper illustrates important points about inadequacies in an
existing data-driven approach (reservoir computing) to modeling complex chaotic systems with intermittencies and couplings to unobserved (multi-scale) processes.
I agree that this work ought to be published and have the following recommendations to make
the work more clear and impactful:
1. Improved notation
2. Simpler plots to accompany the detailed ones
3. A more careful study of the effects of regularization
### Detailed comments ###
Comment on ESNs and memory:
I suggest you distinguish between ESNs and other RNN approaches.
See work by Pantelis Vlachas https://arxiv.org/abs/1910.05266
which suggests that although ESNs have the capability for memory, they often struggle to represent it when compared to fully-trained RNNs.
Note that your references to Shi and Han 2007 and Li et al 2012 all use ESNs with delay-embedding representations.
This essentially defeats the purpose of ESNs, as they are supposed to "learn" this.
A caveat in those papers is the Mackey-Glass DDE, but this is a very simplistic type of memory that can easily be stored by an ESN.
From the work by Vlachas, others, and my own experiments, I have deep suspicions about whether ESNs can learn memory meaningfully at all.
So, I do wish that all the reported experiments could be repeated with a simple ANN, GP regression, Random Feature Map, or other data-driven function approximator that does NOT have the dynamical structure of RNNs/ESNs.
These other approaches I list are generally much easier to train and tune than ESNs and have a more clear interpretation.
In particular, the authors might be interested in Random Feature Map approaches (see Gottwald and Reich 2020 treatment of RF-based regression https://arxiv.org/abs/2007.07383),
as it is essentially an ESN without the recurrence (and is a universal approximator for C1(Rn,Rn) ).
I am not requesting an entirely new study with new methods, but a discussion of why ESNs are chosen ought to consider the perspective I outlined I above.
Equations for ESN:
the "t<T" notation is somewhat unclear...do these indicate a matrix? Perhaps capital letters would be better?
Notes on regularization (line 96)
In what settings were the values of lambda investigated?
Why limit to 10^-8? Perhaps 10^-9 is better?
More importantly---I am quite surprised that there was no effect of lambda on performance, given that many results show a "sweet spot" for the network size...larger networks performing worse is a classic sign of overfitting that can be addressed with increased regularization.
Statistical distributional test:
I found this setup rather difficult to read, and the ultimate choice to use Monte Carlo approximations unclear and potentially statistically unsound.
-To begin, it would help to define zeta as a function mapping between two spaces. Where does u live? Where does x live? How does this relate to y?
What is x(t)? This is the first time it appears in the paper (line 108).
Is R_U in R^1? R^N? a banach space?
What is meant by the "marginal distribution of the forecast sample"? Marginal over what?
-The Monte Carlo approach also confused me---please clarify this a bit more.
Where is the randomness in samples coming from? in time? across initial conditions?
Also what is the goal of the MC process? Is it to estimate a baseline \Sigma for f vs \hat{f}?
-Throughout the paper, neither Sigma nor phi take on a physical meaning for me---it seems that they are simply used to compare methods based on their ability to reproduce statistical quantities.
The statistical validity of chi-squared does not seem to be important in the paper.
Due to this, I would highly recommend using Kullback Leibler-divergence instead of the chi-squared metric---it measures the information loss between probability measures and seems more suited to the job in the paper.
However, I understand this may be significant extra work---perhaps just make a comment that KL could be another option?
Still, I think this section would be much clearer if it were built around KL-div.
-Also, in this section you should give the examples of zeta that are used---1st coordinate and sum of all coordinates. This will help the reader anticipate what is to come.
APE:
-The formulation of s is interesting, as it can equivalently be written as \Delta t times the time average of the derivative \dot{u}.
Would it make sense to then have more stringent demands on divergence when the timestep shrinks?
In the dt -> 0 limit, s->0 which seems odd.
Please justify this choice with respect to the chosen sample rate.
-Also tau_s is not defined (line 148) (or I could not find it).
Moving average filter:
-perhaps note that the the average can be left/right or centered, but I suppose you choose right side of the window since you want to keep the system Markovian.
2.3 Testing ESN
When reading the algorithm 1-8, I became confused with y(t) vs x(t). Please make explicit what is data / truth and what is coming out of the ESN---is the ESN trained on and predicting the full state? Or an observable?
Step 5---is the ESN forecast always produced directly after a training trajectory?
This seems like a rather limiting case---it would be better to re-initialize the trained ESN on a new trajectory...how do we know each ESN hasn't overfit to a small region of state space?
Step 7: Why does u depend on the future t>Train? This t>T notation is a bit confusing to me, and needs to be defined explicitly.
Step 11: Please define v(f) explicitly. Also, why does this equation hold? I see how (10) is true because xf is defined in terms of x.
But in (11) vf is defined by yf (I assume) which is an output from ESN. Is (11) only approximate? Please clarify.
L63
210: are they iid?
220: green -> black
Fig 2: Where do the many trajectories come from? Different trained ESNs (on the same or different training sets)?
Fig 3b: For large noise, why do smaller networks work better? Is this an overfitting problem that can be fixed with regularization?
Same question for 3c...if not overfitting, perhaps larger networks need longer time to initialize?
Fig 3 overall: It is hard to concretely connect phi (and sigma) to the quality of the estimated invariant statistic.
It would help to plot a blue/green/yellow KDE vs the true KDE so that we can see what phi is really discriminating between.
Finally, I feel these plots could be much simpler by collapsing over (or fixing) N---this is only possible if you can pick a large enough N and that regularization can prevent the "sweet spot" issues wrt N.
PM map
Great idea to study this system. L63 is very common and often "too easy".
246: This is not a deterministic map, correct? Also, notation \xi_t would be more consistent than \xi(t).
Fig 4: Why is tau_s better for smaller networks? Again, I hypothesize this is an issue of regularization.
Also, again, this figure feels like information overload for me---if possible, this could be easier to understand when fixed/collapsed over N (and this become a supplementary figure).
Comment: Is the intermittency you observe driven more by the noise or the deterministic PM map itself?
I wonder if the RC performs poorly due to the randomness of the intermittency (driven by \xi_t), rather than the intermittency itself.
Can ESN handle the PM map with epsilon=0? Why or why not?
This seems like a very simple problem that already breaks the data-driven method.
Further comment on what is wrong or what needs to be studied would be very valuable here.
Also is log the natural or common log here?
L96
In this problem, we finally have unobserved scales (and, hence, memory)---this brings up the challenge of memory I mention earlier and the work of Vlachas et al.
I would be very curious whether you'd see different results with a vanilla ANN; alternatively, I wonder if LSTM (which seems to have more hope of retaining memory) would perform differently.
Comments on this would be welcome.
304-305: Please clarify this statement.
Fig 6: Excellent Figure!!!
Fig 7: My main complaints arise again here:
1) The dependence on N feels secondary to the main point of this plot, which compares performance for different c, h and X vs X,Y.
So, again, perhaps fix an N and show a box plot?
2) 1 can only be done safely if the N-dependence is simpler...which, again, I hypothesize can be addressed via regularization.
If not, please show this. Currently, the results are confusing, as they show many performance metrics worsening for larger networks---this is in-line with an overfitting hypothesis.
3) Showing the actual KDEs will bring much more light to the differences in Sigma.
Also, are APE and Sigma normalized w.r.t. dimensionality? Their definitions do not suggest so, and thus I wonder if X vs X,Y results are fair comparisons. Please double check and clarify in the text.
Fig 8: This figure is illuminating, as it shows the explicit dynamcs. The evaluation metrics are excellent for high-throughput comparisons, but their meanings can get lost.
How should we interpret Fig 8?
Is this better/worse than other methods? Is this a pathologically good/bad example? Or is it average?
For contrast, Vlachas et al. seems to have much more realistic looking ESN fits.
NCEP
Fig 9: How can Sigma blow up wrt N but tau_s and eta stay stable?
Is there an issue with long-time stability of the ESNs that is not shown? Vlachas et al. reported on this issue.
## other ##
Line 73: should be "Finally, we"
Line 284: "the the"
--Matthew Levine |