Inverting Rayleigh surface wave velocities for crustal thickness in eastern Tibet and the western Yangtze craton based on deep learning neural networks

. Crustal thickness is an important factor affecting lithosphere structure and deep geodynamics. In this paper, we propose to apply deep learning neural networks called stacked sparse auto-encoder to 10 obtain crustal thickness for eastern Tibet and western Yangtze craton. Firstly taking phase and group velocities of Rayleigh surface wave simultaneously as inputinputs and theoretical crustal thickness as outputoutputs, we construct twelve deep neural networks trained by 70,000 and tested by 30,000 theoretical models. We then invert observed phase and group velocities by these twelve neural networks. Based on test errors and misfits with other crustal thickness models, we select the optimized one as crustal 15 thickness for study areas. Compared with other ways detected crustal thickness such as seismic wave reflection and receiver function, we adopt a new way for inversion of earth model parameters, and realize that deep learning neural network based on data driven with the highly nonlinear mapping ability can be widely used by geophysical inversion method, and our result has good agreement with high-resolution crustal thickness models. Compared with other methods, our experimential results reveal more details: 20 there is a northward-dipping moho gradient zone in Qiangtang block, and relatively shallow northwest-southeast orientation crust at Songpa-Ganzi block. Crustal thickness around Xi’an and Ordos basin is shallow about 35km. Change of crustal thickness in Sichuan-Yunnan block is sharp, where crustal thickness is 60km in northwest and 35km in southeast. We conclude that deep learning neural network is a promising, efficient and believable tool for geophysical inversion.


Conversion from group velocity to phase velocity
The authors calculate group velocity from a published phase velocity map using the standard formula (their equation 4).However, including both phase and group velocity will only add new information if the phase and group velocity are measured independently (as is commonly the case).Therefore it is misleading to include the calculated group velocity in this paper.The group velocity data should be removed from the study or replaced by group velocity data measured independently.(Generally phase velocity is more sensitive to deeper structure so it is easier to infer deep structure from phase velocity measurements.) We do not adopt the calculated group velocity and retrain our neural network on phase velocity only in the revision

Benefit of deep neural network versus shallow neural network
A deep neural network is one with more than one hidden layer, whereas a shallow neural network has just one hidden layer.The additional complication of using a deep neural network is justified if the mapping has a hierarchical structure.For example, in image processing, it is common to move from the more elementary aspects of the input data (e.g. the values of the individual pixels) to intermediate parts (such as the distribution of edges) and finally to the most abstract aspects (such as the subject of the image).While it is undoubtedly true that the Earth has a We retrain our neural network and find that more hidden layers can get more lower test errors than shallow neural network does , which can be demonstrated in table 1 hierarchical structure, ranging from individual grains to entire continents, the authors do not demonstrate that the dispersion data contain sufficiently complicated information to justify a deep neural network.The paper does not currently demonstrate that a deep neural network offers any improvement over a shallow neural network, such as that used by Meier et al.(2007).A comparison should be given.

Non-unique solutions
The authors focus on the non-linearity of the inverse problem, but they do not mention that it is also non-unique.Conventional optimisation of a neural network can lead to meaningless outputs for a non-unique mapping, as shown in figure 3b of Meier et al. (2007).Ideally, the method should be changed to solve for a probability distribution, for example using histogram or median networks (Devilee et al., 1999) or a mixture density network (Meier et al., 2007).Otherwise, the authors should attempt to quantify the range of non-uniqueness, or at least mention it in their discussion.
we have not considered about the uncertainty of crustal thickness which should be revealed by deep mixture density network in a probabilistic manner in our future work 5.Unattributed quotations Some explanatory sections are taken verbatim from other work, for example the paragraph beginning at 3.2:19 is identical to the second paragraph of section 3 of de Wit et al. (2014).These sections should be attributed, and either paraphrased or written in quotation marks.
In the revision we re-write in quotation marks on identical to paragraph of Wit et al. (2014) 6. Meaning of 'data-driven' It is misleading to say that the method 'data-driven' (e.g.lines 1:9-11).The inversion is model-driven; it is trained using a large number of synthetic data which are generated using a known forward mapping (in this case, the calculation of dispersion by normal mode summation).The role of the neural network is to approximate the inverse relation apparent in the synthetic dataset.The description 'data-driven' is appropriate when the forward mapping is not known (or not used).An example would be speech processing, where the meaning of a word cannot be calculated from its audio waveform.
Our manuscript aims at inverse problem, so meaning of data-driven in the manuscript is that we have no idea of inverse relationship , although the forward mapping is known.That is, we have no model describing how to infer crustal thickness from phase velocity.So we think this belongs to data-driven problem.

Lateral resolution of crust thickness
Figures 7 and 8 show a comparison of the crust thickness model in this study with the crust thickness model in Xie et al. (2013).Although the two models are based on the same data, the result in this study appears to resolve much finer features.The authors should explain how this higher resolution is achieved and whether it is justified.
In the discussion we talk out our result resolve much finer features than other models, and these finer features revealed by our result is consistence with Wang et.al(2010) who attained the crustal thickness estimated by the H-k stacking method based on the broad band teleseismic data.We think this higher resolution is achieved as deep sSAE works very well in learning useful high-level feature for better representation of input raw data.

Corrections to the writing
There are some errors in the writing, but I have not listed them in detail, in the expectation that the body of the text will change We check the English sentence by sentence and upload revised manuscript For refree2:

Comments and Suggestions Response
1) The paper has not told the reasons selected eastern Tibet and western Yangtze craton, while this study solves the problems.
We add the reason selected eastern Tibet and western Yangtze craton in revised paper in page 1 from line 29 to line 36 2) What is the theory of the sSAE to inverse the crustal thickness with phase and group velocities of Rayleigh waves?The details to get the dispersion data, phase velocities, and their combination for the sSAE inversion?
(1).the theory of the sSAE to inverse the crustal thickness with phase and group velocities of Rayleigh waves is finding the relationship between the two variables by machine learning.A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer.Firstly taking therotical phase velocities with random noise as inputs and theoretical crustal thickness as outputs we train the deep learning neural networks.Then taking observed phase velocities as input into the trained neural network and can attain an output.The output is seen as estimation of real crustal thickness.
(in which c-phase veloicty; n-radial order;l-angular order;ω -eigenfrequency;α-radius of earth.)Observed dispersion is based on (Xie et.al,2013)from ambient noise, which Rayleigh wave phase speed measurements are obtained from cross correlations of vertical-component ambient noise,the vertical-vertical (Z-Z) cross correlations.
3) How to understand the inverted results for eastern Tibet and western Yangtze craton?The geological background needs to be added.
The Comparable crustal thickness model crust2.0adopted in this paper based on refraction and reflection seismics as well as receiver function studies, the comapration result is shown in the middle of figure 8 5) How to understand Table 1?Table 1 states that different test errors and comparison results between our model and other models based on 11 different neural networks.

6) What is the difference between the results by sSAE and by other method? Not just the similarity.
Compared with other crustal thickness models,our result reveals more details discussed in paper in page 10 from line 9 to line 20.And we add the diffference in tha abstact from line 19 to line 24

Introduction
The eastern Tibet and the western Yangtze craton, one of the key areas to understand the collision process between the Indian and Eurasian plates, and an important area for understanding the collisioncontact relationship between the Qinghai Tibet Plateau and the Yangtze Craton, has always been a hot area of the earth science research, because of the strong seismic activity, the different nature in the two blocks,and especially the special topography, the altitude rises abruptly from about 500 meters in eastern Tibet to 5000 meters in the western Yangtze craton.Many researches focous on understanding the crust and upper mantle structure in this regoin, especlially there have been heated debates on crustal thickness in this region.Discontinuity between crust and mantle called moho discontinuity varying greatly over small length scales is an important factor for geodynamics including crustal evolution, tectonic activities, in addition to the correcting gravity for the crustal effects, seismic tomography and geothermal modeling.Many researches focus on obtaining depth of moho discontinuity called crustal thickness by various data and different methods.
Often crustal thickness can be inverted from many types of data, such asfor instance, inverting deep seismic sounding profile for Chinese continent to get crustal thickness (Zeng et al.,1995), inverting satellite gravity data to get whole global crust and lithospheric thickness (Fang et al.,1999), inverting Bouguer gravity and topography data to get crustal thickness for China and its surrounding areas (Huang et al.,2008;Guo et al., 2012),inverting receiver function to get crustal thickness and Possion's ratio for Chinese continent (Chen et al.,2010;Zhu et al.,2012;Xu et al.,2007;).Especially, a newest crust model called crust1.0 at 1 o ×1 o (Laske et al.,2013;Stolket al., 2013) are based on refraction and reflection seismology as well as receiver function studies.Besides these data related to crustal thickness mentioned above, crust thickness has significant effects on fundamental mode surface waves (Meier et al.,2007,Grad et al.,2009).Dispersion characteristic of surface wave provides a powerful tool to research structure of crust and upper mantle (Legendre, C. P. et al.,2015).So far phase and group velocity measurements of fundamental mode surface waves are most commonly used to constrain shear-velocity structure in the crust and upper mantle on a global scale (Zhou et al. 2006;Shapiro &Ritzwoller ,2002) or on regional scale (Zhu et al.,2002;Zhang et al.,2011;Yi et al.,2008), also the newly developed ambient noise surface wave tomography has been used to constrain shear-velocity structure (Sun et al.,2010;Yaoet al.,2006;Zheng et al.,2008;Zhou et al.,2012),while a few works to invert fundamental mode surface wave data for global or regional crustal thickness and to present a global or regional crustal thickness model (Devile et al.,1999;Meier et al.,2007;Das &Nolet 2001;Lebedev et al.,2013 ).AsAlthough periods and method measured differently between group velocity and phase velocity, whichalso the probing depths are different and measure error are largely independent, the simultaneous inversion of group velocity andphase velocity is more sensitive to deeper structure so it is easier to infer deep structure from phase velocity is substantially better than the use of either alone (Shapiro & Ritzwoller,2002).measurements,we take phase velocity as inputs to infer crustal thickness.
There are several inverse methods to get crustal thickness, and these methods can be broadly classified into two classes: (1) model-driven methods and (2) data-driven methods.For model-driven methods, researchers mainly consider physical relation between earth parameters space and data space to calculate inverse function.Most methods based on model-driven treat crustal thickness inversion as a linear problem, and most importantly, their results are heavily depended on initial earth model.In contrast to model-driven methods, another kind of fully non-linear data-driven method called neural network to put forward to get crustal thickness (Devile et al.,1999;Meier et al.,2007).Neural network with the highly nonlinear mapping ability is widely used by geophysical inverse method based on data-driven, which apply the actual seismic, logging data and its attribute to predict earth parameters.Compared with modeldriven inversion, data-driven inversion maps and predicts an arbitrary nonlinear relationship fast and accurately without considering about physical relations between earth model parameters and data space.As such, neuralNeural networks can be very useful in situations where the forward relation is known, but the inverse mapping is unknown or difficult to establish by more conventional analytical or numerical methods (de Wit et al.,2013).So the target of neural network inversion is to find the mapping from a set of training data.Neural networks have been widely used in different geophysical applications well summarized by van der Baan &Jutten (2000) such as in electrical impedance tomography (Lampinen&Vehtari ,2001), in seismic processing including trace editing, travel time picking, horizon tracking, and velocity analysis.Devilee et al.(1999) were the first to use a neural network to invert surface wave velocities for Eurasian crustal thickness in a fully non-linear and probabilistic manner.Meier et al.(2007) further develop the methods of Devilee et al. (1999), then invert surface wave data for global crustal thickness on a 2• × 2• grid globally using a neural network.
As seismology points out that there are many factors affect phase and group velocity, inverting phase and group velocity for discontinuities within the earth forms a non-linear inverse problem (Meier et al.,2007).Because of strong non-linear relations between crust thickness and surface wave dispersion, we cannot treat it with a linear inverse problem as Montagner&Jobert (1988) stated.Although shallow neural network with less number of hidden layers, can present nonlinear inverse function, it maybe cannot learn or approximate the true inverse function well when the true inverse function is too complicated.In contrast, deep learning neural network can overcome this defect since it has powerful representation abilities and can discover intricate structures in large data sets, because it take use of the back-propagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer (LeCunet.al.,2015).
In this paper, considering the advantages and characteristics of deep learning neural network, a new fast inverse method based on data-driven, called deep stacked Sparse Auto-encoders (sSAE) neural network is introduced to solve the nonlinear geophysical inverse problems.We focus on deep learning neural networks to solve the non-linear inverse problem, and then apply them to retrieve the crustal thickness for eastern Tibet and western Yangtze craton from newest and high-resolution phase and group velocity maps.Based on normal mode theory we compute phase and group velocities for the sampled radially symmetric earth models to generate 100,000 theoretical models.Firstly takingwe take theoretical phase and group velocities of Rayleigh surface wave simultaneouslywith random noise as inputs andto enhance robustness of neural networks and take corresponding theoretical crustal thickness as outputs, we.We construct twelve deep neural networks trained by 70,000 and tested by 30,000 theoreticalsynthetic models.We then invert observed phase and group velocities by these twelve neural networks.Based on test errors and misfits with other crustal thickness models, we select the optimized one as crustal thickness for study areas.
To the best of our knowledge, we are the first to introduce deep learning neural networks to learn and invert crustal thickness, and our result reveals that crustal thickness is strong nonlinear with respect to phase and group velocity.The merits of our methods include: Firstly, since deep learning neural networks can represent complex functions, it is possible to learn the crustal thickness inverse function precisely.Secondly, inverse mapping based on neural network is of high efficiency because new observations can be inverted instantaneously once well-trained deep learning neural networks with multiple hidden layers are constructed.Moreover, our deep learning neural networks are trained on vast synthetic models.Lastly, our results show changes of the number of neurons in each layer have little influence on test errors when the numbers of network hidden layer achieve six and test errors are about 24.5e-6, which indicates deep learning neural networks are robust to neural network structures with suitable layers.In what follows, we first give a short introduction to deep learning neural networks.

Deep Learning Neural Networks
In geophysics the true inverse function is usually a very complicated one between data space and model space.Traditional linear inverse methods treating the true inverse function as linear one can resolve linear relation problems.However, they depend on physical relationrelationships between two parameter spaces and initial earth models.Neural network has its origins in attempts to find mathematical representations of information processing in biological systems (Bishop ,1995).The more deep strength of Artificial Neural Networks (ANNs) is, the more capabilities learn to infer complex, non-linear, underlying relationships without any a priori knowledge of the model (Bengio,2009).Shallow neural network has gained in popularity in geophysics last decade and has been applied successfully to a variety of problems such as well-log, interpretation of seismic data, geophysical inversion, etc.Although shallow neural network can present nonlinear inverse function, it can only learn the relatively simple inverse function.In contrast, Many research results indicate that deep learning neural network has powerful representation ability and can apply a big geophysical observable data to learn and approximate the complicated inverse function well [ Lecun et al.,2015Bengio et al.,2006;Liu et al.,2015].
Based on the analysis above, we design deep learning neural network to obtain crustal thickness for eastern Tibet and western Yangtze craton.Compared with shallow neural networks, deep learning neural network allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction and can learn complex functions.The essence of deep learning is building an artificial neural network with deep structures to simulate the analysis and interpretation process of human brain for data such as image, speech, text, etc.However, many research results suggest that gradient-based training of a deep neural network gets stuck in apparent local minima, which leads to poor results in practice (Bengio, 2009).Fortunately, the greedy layer-wise training algorithm proposed by Hinton et.al 2006 overcomes the optimization difficulty of deep networks effectively.The training processing of deep neural networks is divided into two steps.Firstly, unsupervised learning methods are employed to pre-train each layer parameters with the output of the previous layer as input, giving rise to initialize parameter values.After that, the gradient-based method is used to finely tune the whole neural network parameter values with respect to a supervised learning criterion as usual.The advantage of the unsupervised pre-training method at each layer can help guide the parameters of that layer towards better regions in parameter space (Bengio,2009).There are multiple types of deep learning neural network, such as convolutional neural networks, deep belief net and stacked Sparse Auto-encoders(sSAE). sSAE works very well in learning useful high-level feature for better representation of input raw data.Since sSAE learning algorithm can automatically learn even better feature representations than the hand-engineered ones, sSAE is used widely in many domains such as computer vision, audio processing, and natural language processing [Hinton,2006;Deng,J et al.,2013].Similar to these problems, we need extract earth feature representation from dispersion of surface wave.
Here we introduce Sparse Auto-encoder briefly, and detailed description of the network training method is given by Liu et al.(2015).
The structure of sSAE is stacked by sparse auto-encoders to extract abstract features.A typical Sparse Auto-Encoder (SAE) can be seen as a neural network with three layers, as shown in Figure 1, including one input layer, one hidden layer, and one output layer.The input vector and the output vector are denoted by v and v �, respectively.The matrix W is associated with the connection between the input layer and the hidden layer.Similarly, the matrix W � connects the hidden layer to the output layer.The vector b and  � are the bias vectors associated with the units in the hidden layer and the output layer, respectively.The SAE is trained to encode the input vector v into some representation so that the input can be reconstructed from that representation.Let f(x) denote the activation function, and the activation vector of the hidden layer then is calculated (with an encoder) as: where z is the encoding result and some representation for the input v.The representation z, or code is then mapped back (with a decoder) into a construction  � of the same shape as v.The mapping happens through a similar transformation, e.g.: Figure 1.An auto-encoder with one hidden layer.(Liu etal.,2015)SAE is an unsupervised learning algorithm which sets the target values to be equal to the inputs and constrain output of hidden layer which are near to zero and most hidden layer are inactive, the cost function is expressed as: (3) Here J(W, b) is cost function without sparsity constrain, β controls the weight of the sparsity penalty term,S 2 is the number of neurons in the hidden layer, and the index j is summing over the hidden units in our network.ρ� j ρ � j is the average activation of hidden unit j, ρ is a sparsity parameter, typically a small value close to zero.
Further, a stacked Sparse Auto-Encoder (sSAE) is a neural network consisting of multiple layers of SAE in which SAE are stacked to form a deep neural network by feeding the representation of the SAE found on the layer below as input to the current layer.Using unsupervised pre-training methods, each layer is trained as sSAE by minimizing the error in reconstructing its input which is the output code of the previous layer.After all layers are pre-trained, we add a logistic regression layer on top of the network, and then train the entire network by minimizing prediction error as we would train a traditional neural network.For example, a sSAE with two hidden layers is shown in Figure 2.This sSAE is composed of two SAEs.The first SAE consists of the input layer and the first hidden layer, and the representation or code of the input v is h 1 = f(W 1 v + b 1 ).The second SAE comprises of two hidden layers, and the code of h 1 is h 2 = f(W 2 h 1 + b 2 ).Each SAE is added to a decoder layer as shown in Figure 1, and we can then employ unsupervised pre-training methods to train each SAE by expression (1).Finally, the matrixW 1 ,W 2 ,bias vector b 1 and b 1 are initialized.We then apply supervised fine-tuning methods to train entire network.Since our aim is calculating crustal thickness and this is a regression problem, we attach a layer connected fully with last layer of the encoder part (the matrix W s ).After that, we train this network as done in a traditional neural network.

Inverting surface wave data for crustal thickness
As Meier et al. (2007) demonstrated that the neural network approach for solving inverse problems is best summarized by three major steps as shown in Figure 3: (1) forward problem.In this stage we proceed by randomly sampling the model space and solve the forward problem for all visited models based on seismic wave normal mode theory.(2) designing a neural network structure.In this stage taking phase and group velocities with random noise as inputs and theoretical crustal thickness as outputs we train the deep learning neural networks and get an optimized one.(3) inverse problem.Base on trained networks we invert crustal thickness from observed phase and group velocities.
In what follows we show how to train a sSAE deep learning neural networks to model surface wave dispersion based on synthetic seismogram, then invert dispersion curves based on the trained networks.Finally we compare our crustal model with other crustal thickness models, and discuss the geodynamic consequences implied by our model.2014), which is based on the Preliminary Reference Earth Model(PREM, Dziewonski and Anderson,1981)and is parameterized on a discrete set of 185 grid points used by Mineos package (Masters et al., 2014).In addition, these models we have got show no correlations between physical parameters such as velocity, density,  and attenuation profiles.As the model parametrization method mentioned above, we generate 100,000 synthetic models based on the 1-D reference models PREM, which are randomly drawn from the prior model distribution, also prior ranges for the various parameters in our model are given in tables A.2-A.4. of de Wit et al.(2014).We use the Mineos package to compute phase and group velocity for fundamental mode Rayleigh waves for all 100,000 synthetic 1-D earth models.As for observation data used in stage of inversion below, phase velocities are more sensitive to the deep structure than group velocity.Based on Rayleigh wave phase velocity from ambient noise (Xie et.al,2013) shown in Figure 4 averaged from 10 to 35mHz, we take these as inputs for our neural networks.
As for observation data used in stage of inversion below, group and phase velocities carry the same information, although group velocities are more sensitive to the shallow structure.Since a larger part of the signal is affected by the crustal structure, combination two types of data will constrain crustal thickness better in the presence of noise (Devilee et al.,1999).The two are related by Where U denotes group velocity, C denotes phase velocity and T is period.Based on Rayleigh wave phase velocity from ambient noise (Xie et.al,2013) shown in Figure 4 averaged from 10 to 35mHz, we compute corresponding group velocity according (4) shown in Figure 5 averaged from 10 to 30mHz.

training sSAE deep learning neural network
As we all know, using a set of examples of corresponding input-output pairs, artificial neural networks can approximate an arbitrary non-linear function to solve the non-linear inverse problem.These examples are presented to a network in a so-called training process, during which the free parameters of a network are modified to approximate the function of interests (de Wit et al. 2014).Here adopting sSAE deep learning neural network, detailed methods presented in section 2 above, we pre-train the neural network taking theoretical group and phase velocity of Rayleigh wave with random noise as inputs and theoretical crustal thickness as outputs andto attain the initial weights and bias for neural network.And then we take theoretical group and phase velocity of Rayleigh wave with random noise as input, and crustal thickness as output to fine-tune neural network as done in a traditional neural network.
NeuralHow to find a satisfactory structure of neural network is a difficult problem because neural network training is sensitive to the random initialization of the network parameters.Therefore, as de Wit et al. ( 2014) pointed out that it is common practice to train several neural networks with different initializations, and subsequently choose the network which performs best on a given synthetic test data set, and the network which performed best on the test set is used to draw inferences from the observed data.(de Wit et al. 2014).After trying many times, we find the proportion of training data set to test one is 3:1 is reasonable (Figure 65).We have got final test errors which may be produced not only by different neural network structure decided by the number of inputting neuron, hidden layers and neuron in middle layer, also optional parameters such as number of traintraining epochs and size of batch.What's more, type of activation function, value of learning rate, zero masked fraction, and value of non-sparsity penalty can affect final test errors.We give twelve cases and their corresponding test errors in table 1.

inverting crust thickness
Based on our all twelve neural networks, we invert Rayleigh phase velocities (10~35.0mHz)and group velocities (10~30.0mHz0mHz) to attain twelve crustal thickness models for eastern Tibet and western Yangtze craton.Considering not only the test errors of sSAE networks, also misfits and correlation   coefficients of our twelve models with crustal thickness models from other researches, we select network structure given in table 1 shown in ※.We find the best fit crustal thickness model from sSAE (Figure 76).We compare our model with crustal thickness model from receiver function (Zhu et al.,2012),and the other two global crustal thickness models, CRUST2.0 from Bassin et al. (2000) based on refraction and reflection seismics as well as receiver function studies and the CUB2 model from Shapiro&Ritzwoller (2002)( Figure 8)7) who inverted a similar data set for crustal thickness using a Monte Carlo approach in the same region.The correlation coefficients and scatter plots of our model versus ZJS, our model versus CRUST2.0 and our model versus CUB2 (Figure 98) indicate that overall agreement between the three models.However, the agreements of our model with CUB2 and CRUST2.0 are better than with ZJS, since model ZJS attained from Zhu et.,al(2012)In this article, we fixed the following fourthree parameters in every situation: A-type of activation function(sigma); B-learning rate(1); C-zero masked fraction(0.5).Various parameters: D-non-sparsity penalty, which is zero except for layer 1 in every sASE structure; Enumber of epochs; F-batchsizesize of batch.G-RMS misfit of our result with other model; H-correlation coefficient of our result with other model.※-selected sSAE neural network structure

Discussion
On the oneOn the one hand, our results show deep learning neural networks can invert crustal thickness effectively due to their owning capability to represent complex inverse functions: A deep neural network can offer improvement over a shallow neural network as shown in Table 1.Test errors of deep learning neural network may be influenced by the number of hidden layer in networks which shows more hidden layers induce smaller test errors, which we can attain from Table 1 when the number of hidden layer in networks adds from three to six, test error decreases from 2.6e-4 to 6.0e-6.In addition, the robustness of deep learning neural networks is strong.When the number of hidden layers in network achieves six, changes of the number of neurons in each layer have little influence on test errors which is about 5.5e-6.
In addition, we conclude that different training parameters have different effect on training results.We conclude that the size of batch is more important than epochs shown in Table 1.The size of batch decreases from 1e4 to 1e3 and test errors decrease from 2.6e-4 to 7.9e-5, however, Epochs increase from 10 to 100, corresponding test errors change a little.The neural network structure shown in ※ from table 1 reveals misfits of our model with model CUB2, CRUST2.0 and ZJS are relatively low with 6.75,6.70 and 8.0, and corresponding correlation coefficients are relatively high with 0.8, 0.82 and 0.69 respectively, however, test error is 7.22e-6 and is not minimum.This tells us test error may be not the only criterion determining which neural network is best because small test error may be induced by overfitting.
Compared with works of Meier et al.(2007), to enhance robustness of neural networks we add random noise into synthetic phase velocity as inputs in training progress.However, we have not considered about the uncertainty of crustal thickness which should be revealed by deep mixture density network in a probabilistic manner in our future work.
On the other hand, we can attain the crustal thickness and resultant geodynamic consequences in research region from our result.We find the relatively good agreement of our result (Fig. 76) with CUB2(Fig.87),CRUST2.0 (Fig. 98).All these three models indicate that crustal thickness is deeper in the west of Longmen mountain than in the east of Longmen mountain.Moreover, our result reveals more details: the eastern Tibetan Plateau crustal thickness is complex and changes largely.The average crust thickness is about above 60km, especially about 70-75km at Qiangtang block, under which there is a northward-dipping moho gradient zone.There is relatively shallow crust at Songpa-Ganzi block and is characteristic of decreasing in northwest-southeast orientation.Model CUB2.0 tells us the crustal thickness of Sichuan basin is about 40km and is relatively smooth, however our model reveals there are some changes about crustal thickness in this region, that iswhich crustal thickness is thin around Chengdu especially northeastward to Chengdu, in addition there is about 50km thick crust under Qinlin-Dabei fold belt,also we can get that crustal thickness of northeast to Sichuan basin is about 45~48km.What's more, crustal thickness around Xi'an and Ordos basin is shallow about 35km.Conversely, change of crustal thickness in Sichuan-Yunnan block is sharp, where crustal thickness is 60km in northwest and 35km in southeast.All detailed information is consistence with Wang et.al(2010) who attained the crustal thickness estimated by the H-k stacking method based on the broad band teleseismictele-seismic data recorded at 132 seismic stations in Longmen mountains and adjacent regions(26°~35°N,98°~109°E).In addition, afterFrom a geological viewpoint, The eastern Tibet and the western Yangtze craton has a very complex structure and tectonics, where several tectonic blocks, including the Yangtze Platform, the Songpan-Ganzi Fold System, the Qiangtang Block, and the Indochina Block, are interacting with each other.It is a site of important processes associated with the India-Asia collision and abutment against the stable Yangtze Platform, including strong compressional deformation with crust shortening and thickening, the plateau surface has been elevated to 4-5 km, and the Tibetan crust has doubled in thickness since the collision [Chen ea al.,1996;Flesch et al., 2005;Wang,2010], east-west crustal extension, and strong earthquakes often occur on the active faults inside and on the edge of the plateau and are the most active seismic areas within the mainland.After analysing the distribution of the epicenters during 1970-2015, we realize that great earthquakes in Sichuan and Yunnan have occurred in brittle upper crust in Longmen mountain fault zone, where crustal thickness changes sharply as to about 10km, and great Ms 8.0 Wenchuan earthquake in 2008 and Ms 7.0 Lushan earthquake in 2013 occurred, which are due to the reactions associated with the Songpan-Ganzi Fold System and the Qiangtang Block obliquely colliding with the Yangtze Platform.The reason may be that main fault cut moho discontinuity where materials exchange between crust and mantle and accumulating press induce a series of earthquakes frequently.
On the other hand, our results show deep learning neural networks can invert crustal thickness effectively due to their owning capability to represent complex inverse functions: Test errors of deep learning neural network may be influenced by the number of layer in networks which shows more layers induce smaller test errors, which we can attain from Table 1 when the number of layer in networks adds from three to six, test error decreases from 1.7e-4 to 2.5e-6.In addition, training parameters as batchsize decrease from 1e4 to 1e3 and test error decreases from 1.7e-4 to 2.5e-5.Also when epochs increase from 10 to 100, corresponding test error decreases from 2.0e-5 to 8.4e-6.
The robustness of deep learning neural networks is strong.When the number of layers in network achieves six, changes of the number of neurons in each layer have little influence on test errors which is about 2.5e-6.
The neural network structure shown in ※ from table 1 reveals misfits of our model with model CUB2, CRUST2.0 and ZJS are relatively low with 6.62,6.70 and 6.63, and corresponding correlation coefficients are relatively high with 0.78, 0.80 and 0.69 respectively, however, test errors is 8.4e-6 and is not minimum.This tells us test error may be not the only criterion determining which neural network is best because small test error may be induced by overfit.

Conclusion and remarks
Taking use of sSAE deep learning network, we present crustal thickness map of eastern Tibet and western Yangtze craton(Fig.7).The data sets consist of phase velocities of Rayleigh waves from Xie(2013) at discrete frequency of 10. 0,12.5,15.0,17.5,20.0,22.5,25.0,27.5,30.0,32.5,35.0mHzand derived group velocities of Rayleigh waves at discrete frequency of 10. 0,12.5,15.0,17.5,20.0,22.5,25.0,27.5,30.0mHz.We conclude that:.We conclude that: (1) For all our simulations we use sSAE with different neural network structures which are decided by many factors such as, for instance, the number of hidden layers and neurons in neural networks, optional parameters such as the number of epoch and batchsize, size of batch, type of activation function, values of learning rate and non-sparsity penalty and so on.We find that parameters such as the number of hidden units is not aand size of batch are crucial parameter and for training neural networks with different number of hidden units give similar results, however batchsize is an important factor for results.
(2) After inverting these twelve networks, different networks produced different results.When test errors achieve some value, misfits are high and correlation coefficients are low, which we think it is maybe caused by overfitoverfitting.This means networks fit well on training data set, but generalization ability does not increase.In our future work, we'll will focus on how to resolve this problem in using sSAE.
(3) We present a crustal thickness model for eastern Tibet and western Yangtze craton.Compared our model with current knowledge about crustal structure as represented by ZJS,CRUST2.0,CUB2.The overall agreement with these three models is very good, and agreement is generally better with CUB2 and CRUST2.0attained from relatively dense stations with rich data coverage and higher resolution.
(4) The results are obtained using a neural network approach called sSAE which is widely and successfully used in pattern recognition.As we all know, geophysical inversion is so complex that we should analysis and enhance neural network to apply to these complicated problems.
geological background about eastern Tibet and western Yangtze craton are added in revised paper in page 10 from line 23 to line 36 4) What are the merits of sSAE over other methods in fact?For instance, deep seismic sounding profile is the direct evidence of crustal thickness, what happens when two kinds of results are mapped together?Not the digital number listed in the table.
the model parametrization methodology outlined in de Wit et al. (

Figure 3 .
Figure 3. Crustal thickness inversion based on sSAE neural network composed of two parts: Forward Problem and Inverse Problem.

Figure 4 .
Figure 4.Averaged phase velocity of western Yangtze craton(Xie et al.,2013) from 10 to 35mHz.The black lines in the figure show structure lines.The blue lines show boundaries of sedimentary basins.The red dots show seismic events in this region from 1975 to 2015, and size of dot demonstrates size of magnitude from Ms 6.0 to Ms 8.0.The yellow and purple stars demonstrate Wenchuan and Lushan earthquakes respectively.These are same to Figure 4, Figure 6 and Figure 7.

Figure 5 .
Figure 5.Averaged group velocity of western Yangtze craton according to Xie et al., 2013 from 10 to 30mHz.

Figure 4 .
Figure 4.Averaged phase velocity of western Yangtze craton(Xie et al.,2013) from 10 to 35mHz.The black lines in the figure show structure lines.The blue lines show boundaries of sedimentary basins.The red dots show seismic events in this region from 1975 to 2015, and size of dot demonstrates size of magnitude from Ms 6.0 to Ms 8.0.The yellow and purple stars demonstrate Wenchuan and Lushan earthquakes respectively.These are same to Figure 4, Figure 6 and Figure 7.

Figure 5 .
Figure 5.The relationship between proportions of training data sets to test data sets and test errors.

Figure 6 .
Figure 6.The relationship between proportions of training data sets to test data sets and test errors.

Figure 7 .
Figure 7.Crustal thickness of western Yangtze craton from this paper.

Figure 6 .
Figure 6.Crustal thickness of western Yangtze craton from this paper.

Figure 87 .
Figure 87.Crustal thickness of model CUB2 from Shapiro&Ritzwoller (2002) has relatively sparse stations with poor data coverage and lower resolution.