Interactive comment on “ Non-parametric Bayesian mixture of sparse regressions with application towards feature selection for statistical downscaling ”

The manuscript reports a Bayesian mixture of sparse regression models for clustering using the Dirichlet process mixture model for feature selection. The purpose is to be able to find interpretable clusters that can allow for climate model downscaling based on key features. This is a potentially useful approach to the discovery of unique features in highly complex datasets for use in statistical downscaling.


Introduction
Climate change is one of most challenging problems facing humankind.Its impacts are expected to influence policy decisions on critical infrastructures, management of natural resources, humanitarian aid, and emergency preparedness along with numerous regional-scale human economic and social activities.Therefore, it is imperative to accurately assess the impacts of climate change at regional scale in order to inform stakeholders for appropriate decision making related to mitigation policies.Global climate models (GCMs) are the most credible tools at present for future climate projection that accounts for the effects of greenhouse gas emissions under different socio-economic scenarios.Although GCMs perform reasonably well in projecting climate variables at a larger spatial scale ( > 10 4 km 2 ), they perform poorly for regional-scale climate projections.Such poor performance of the GCMs, coupled with the importance of regional climate projections for impact studies, has led to development of limited area models (LAMs) or regional climate models (RCMs), where finer spatial grids over a limited spatial area are embedded within a coarser GCM grid.This method is also known as dynamic downscaling.However, these models are complex, computationally expensive and require rerunning for each new region.Moreover, regional models inherit the basic gaps in understanding of climate physics that limit the performance of GCMs.A couple of recently published studies (Kumar et al., 2014;Knutti and Sedláček, 2013) rigorously compared the projections of the latest generation of climate models (CMIP5) with the previous generation (CMIP3) but found no significant improvement in the majority of statistical performance metrics even with higher spatial resolutions and addition of new physical processes in the computational model.Uncertainties in sub-grid-scale cloud-microphysics and ocean eddy processes and poor understanding of the effect of carbon cycle and other biogeochemical processes on climate systems still limit the ability of the physics-based climate models to reliably project future climate (Bader et al., 2008), especially at regional scale.
A complementary approach for regional projection is statistical downscaling that uses statistical models to learn empirical statistical relationships between large-scale GCM features (predictors) and regional-scale climate variable(s) (predictands) to be projected.The statistical approaches of downscaling can be categorized into three broad classes -weather typing, weather generators, and the transfer function approaches (Wilby et al., 2004).Weather typing approaches have originally been developed for weather forecasting and generally involve classifying days into similar clusters or weather states based on their synoptic similarity.Typically, weather patterns are clustered based on their similarity with nearest neighbors while the statistical models they use vary in their definition of similarity measures.On the other hand, weather generators replicate the statistical properties of the daily predictand variable by using a stochastic model, such as Markov processes (Greene et al., 2011), that uses wet-dry and dry-wet transition probabilities as input for training while conditioning its parameters on large-scale predictors.
In this paper, however, we are interested in transfer function based regression models that learn a linear or nonlinear mapping between large scale predictors and regional scale predictand variables.Regression models are conceptually the simplest of the three classes since they provide a direct mapping between the predictor and predictand values.However, the success of the regression models depends on the accurate choice of predictors.Sparse regressions based on constrained L1-norm (Tibshirani, 1994) of the coefficients became popular due to their ability to simultaneously select covariates and fit parsimonious linear models that are more generalizable and easily interpretable.Although sparse regression models have been applied widely in many disciplines, their application to climate, and especially to statistical downscaling, has remained very limited.In a recent paper (Ebtehaj et al., 2012), sparse regularization has been shown to be effective for downscaling rainfall fields for weather forecasting, whereas sparse variable selection has been used for statistical downscaling of climate variables (Phatak et al., 2011) in a separate paper.To our knowledge, there is no other published work on use of sparse regularization for statistical downscaling.
However, large complex climate data sets often exhibit dynamic behavior (Kannan and Ghosh, 2010) which may not be modeled well by a single regression model.Here we propose a nonparametric model for mixture of sparse regressions that can accommodate multiple sparse linear relationships inherent in the data set.Nonparametric models are more flexible than the finite mixture models (Bishop and Svenskn, 2002) since they assume no prior knowledge about the number of distinct components in the data.We used a Dirichlet process mixture (DPM) (Antoniak, 1974) with stick-breaking construction (Ishwaran and James, 2001) to accommodate an unknown number of sparse regression models in the data.DPM start by assuming infinite components in the data but ends up discovering a finite number of components supported by the data.We used the Bayesian version of sparse regression (Park and Casella, 2008) to smoothly integrate the sparse regression model with the DPM, which is a nonparametric Bayesian approach where each component is represented by a set of distribution parameters specific to the corresponding component.
Although the number of different components may not be known, prior knowledge often exists about whether a pair of observations belong to the same component.For example, it is reasonable to assume that two observations close in time from the same location may exhibit similar behavior.We allow soft "must link" constraints between pairs of data-points that encourage the pair to belong to the same mixture component.Such constraints are incorporated in our Bayesian model with the help of a Markov random field (MRF) prior over the cluster indicator variables (Ross and Dy, 2013;Basu et al., 2006).
Variational Bayesian (VB) inference has been shown to be much faster than stochastic alternatives for nonparametric Bayesian models (Blei and Jordan, 2006).The major contribution of this paper is to develop a fully Bayesian formulation for nonparametric mixture of a sparse regression model and designing an efficient variational inference algorithm to obtain posterior distributions over the regression coefficients of potentially multiple regression components as well as the component membership probabilities of each data-point.
We have extensively demonstrated the performance of our algorithm on synthetic data.We have also applied our method to the feature selection problem for statistical downscaling of annual average rainfall over two regions on the west coast of the USA.Preliminary results from the application of our algorithm to select features for regression based statistical downscaling show that our method may lead to improved prediction and discovery of new insights.

Background
In this section, we provide brief descriptions of the methods in the context they were used to build our model.

Bayesian sparse regression
Let us assume that we are given a data set D = {x n , y n : n = 1, . . .N } that has been generated from a linear model identified by sparse coefficients vector β.In a non-Bayesian setting, sparsity is enforced by a constraint on the L1-norm of the coefficients which is given by where ∼ N (0, τ −1 ).However, in a Bayesian setting, the sparsity can be imposed by a Laplace prior (also known as double exponential distribution) on β which is given by (Park and Casella, 2008): However, due to the analytical intractability of the Laplace prior, it is often represented in the following scale-mixture (of Gaussians) form using an additional random variable α.
For a fully hierarchical Bayesian setting, Gamma prior is imposed on parameter τ as well as on individual penalty parameters γ j .So the joint distribution over all the parameters can be given by (3)

Markov random fields
An MRF is represented by an undirected graphical model in which the nodes represent variables or groups of variables and the edges indicate dependence relationships.An important property of MRFs is that a collection of variables is conditionally independent of all others in the field given the variables in their Markov blanket.The Hammersley-Clifford theorem states that the distribution, p(Z), over the variables in an MRF factorizes according to where Z is a normalization constant called the partition function, C is the set of all cliques in the MRF, z c is the set of variables in clique c, and H c is the energy function over clique c (Geman and Geman, 1984).A clique is a set of nodes in a graph that are fully connected.The smallest clique in a graph is an edge.The energy function captures the desired configuration of local variables.Partition function Z normalizes the probability measure and it is computed by summing the exponentiated energy functions of all possible configurations.

Dirichlet process mixture (DPM)
The Dirichlet process (DP) was first introduced in statistics literature as a measure on measures (Ferguson, 1973).It is parameterized by a base measure, G 0 , and a positive scaling parameter λ: G| {G 0 , λ} ∼ DP (G 0 , λ) . (5) The notion of a DPM arises if we treat the kth draw from G as a parameter of the distribution over some observation (Antoniak, 1974) representing a particular mixture component.DPMs can be interpreted as mixture models with an infinite number of mixture components in the sense that data exhibit a finite number of components but previously unseen components represented by new data can still be accommodated.More recently, a variational inference algorithm for DPMs was introduced (Blei and Jordan, 2006) using the stick-breaking construction (Sethuraman, 1994) which uses two infinite collections of random variables For a mixture of sparse regression models, if the parameters for each components are given by η k , the subsequent data generation process for such a mixture model can be described in the following steps using a stick-breaking construction: 4. For each data-point n: We can truncate the construction process at k = K by enforcing v K−1 = 0 which forces all θ k for k > K to be zero (see step 3).The resulting construction is called a truncated DP (TDP), which can be shown to approximate the true DP quite well given K is large relative to the number of the datapoints (Ishwaran and James, 2001).

Methodology
Now, let us assume that we are given a data set D = {x n , y n : n = 1, . . .N } which has been generated from a mixture of K different sparse models identified by sparse coefficients β (1) , β (2) , . . ., β (K) .Let us also assume that the number of components K is unknown.We use a Bayesian formulation of the sparse regression model for each component β (k) , with k = 1, 2, . . .K. Let us first state the Bayesian version of the kth sparse model.The linear regression model of the kth component can be represented by the following Gaussian distribution.

Mixture of sparse regressions
We introduce K-dimensional latent indicator variables {z n : n = 1, . . .N } to represent the component membership of each data-point {x n , y n }.If the data-point belongs to the kth component, then z nk will be 1 and all other elements of z n will be 0. We further denote Z = [z 1 z 2 . . .z n ].We can now rewrite Eq. ( 8) in terms of z n as For this mixture of sparse regressions model, each component has a separate parameter set {β (k) , τ k }.Moreover, after adding the parameters related to the scale-mixture representation of the Laplace prior on β (k) (refer to Sect.2.1), the set of parameters is finally given by η k = {β (k) , τ k , α k , γ k }.The prior distribution G 0 from which these parameters can be drawn jointly is given in Eq. ( 3).We can now use the stickbreaking construction described in Sect.2.3 to formulate our mixture model.The overall generative process is then: The graphical model that represents the dependence relationships between all the parameters involved in this current mixture model is shown in Fig. 1.The shaded circles denote observed variables; the unshaded circles denote unobserved variables.We have used a Gamma prior on λ having a hyper-parameter m 0 .We have omitted the hyper-parameters a 0 , b 0 , c 0 , d 0 , and m 0 from the list of conditioning variables in the left side to avoid clutter.The individual distributions in Eq. ( 10) are given below.

Accommodating "must link" constraints
Prior knowledge about must link constraints between pairs of data-points can be enforced via an MRF prior on the indicator variables z n , where each data-point is considered a node and each constraint between a pair of data-points is regarded as an edge between the respective nodes.We denote the collection of edges by C and the MRF prior is given by Eq. ( 4).We define the energy function as: Here ML means must link.This prior encourages similar values of indicator variables z i and z j if they happen to share a "must link" edge.Since the MRF prior is assigned only on the indicator variables Z, it only alters Eq. ( 11b) and the new prior on Z is given by

Variational inference
Let us consider all the unknown parameters in our model as latent variables and denote all the latent variables by }, λ}.Moreover, from now on, we will ignore feature variables X from the list of conditioning variables as they are observed.Using Jensen's inequality, we can find a lower bound of the log-marginal ln p(y) which is given as ln p(y) > q(H) ln p(y, H) q(H) dH ( 14) for any arbitrary distribution q(H).The variational inference is performed by restricting q(H) within a parametric family so that the maximization of the lower bound given in Eq. ( 14) is tractable.We consider only those q(H) that factorize over some disjoint groups of the component random variables of H in the following way: We can now maximize the lower bound given in Eq. ( 14) with respect to each component q j (h j ) in Eq. ( 15) and obtain the parametric form of q j (h j ) given by where the expectation is taken with respect to all the other factors {q i } for i = j .It can be shown that the q(H) obtained this way is the closest approximation of the actual posterior p(H|y) in terms of KL-divergence out of all possible alternatives of the form given by Eq. ( 15).Therefore this is a deterministic but approximate posterior inference method, unlike stochastic inference methods such as MCMC, which samples from the actual posterior.However, variational inference is much faster and approximates the true posterior reasonably well for practical purposes.
Once we apply Eq. ( 16) to the joint distribution described in Eqs. ( 10) and ( 11), we can get the update equations for the approximate posterior distributions for each of the latent variables involved.
1. Distribution of z: with 2. Distribution of {β (k) }: with Here diag( α (k) ) corresponds to the LASSO (Tibshirani, 1994) shrinkage.The moments are given by1 3. Distribution of τ : where The relevant moments are with Relevant moments are given by ln with where InvGaussian(α (k) j ; g k j , h k j ) denotes inverse Gaussian distribution with mean g k j and shape parameter h k j having the following density function.
The relevant moments are given by 6. Distribution of {γ (k) }: with and the relevant moment is γ where Relevant moment is λ = u w .The first part of the variational posterior of q Z (Z) in Eq. ( 17) arises from the MRF prior and contributes towards enforcing "must link" constraints.Note that V in Eq. ( 17) is a set of sets and V is a component set of connected nodes within V. Basically, V denotes the set of connected components within the constraint graph described in Sect.3.2.Therefore the partition function Z V needs to be computed only for the connected components, not for the entire graph.Computing Z V becomes tractable if the connected components are small (i.e., the constraint set is sparse).
In order to automatically generate a sparse constraints set, we first implemented all the constraints in the form of edges and then used a graph partitioning algorithm (Hespanha, 2004) to partition the constraint graph in such a way that none of the partitions are left with more than a predefined number of nodes.At the time of inference we used a "backtracking" algorithm (Tarjan, 1972) to find the strongly connected components within the graph.To compute the expectation E[z], we first computed the multinomial probabilities ρ nk and then did an MRF update on each connected component by computing the probabilities of each possible state combination and summing the probability-weighted state matrices.The partition function is computed by summing the exponentiated sum of energy function of each state matrix.Note that isolated nodes (not part of any connected components) will not need their ρ nk updated.
The parameters of each of the distributions has dependency on moments of one or more of the other variables.We therefore find a locally optimum solution via an iterative process that starts with random initial values of the relevant moments and stops when the indicator variables Z stop changing.Note that once the approximate solution is reached, we can compute the marginal distributions over coefficients β (k) p which is a Gaussian with mean µ (k) p and variance (k) pp for each k.We can thereby perform a t test to determine whether the corresponding feature has a non-zero coefficient.

Computational considerations
One computational bottleneck of the proposed VB algorithm is the inversion of the D × D matrix in Eq. ( 21).If D < N , then faster matrix inversion can be achieved by first applying a Cholesky decomposition and then inverting the resulting upper triangular matrix.However, if D > N , we can first apply a fast (approximate) singular value decomposition on (k)−1 and then use Woodbury matrix inversion identity so that we now have to invert a N × N matrix instead.
We have truncated the infinite DP at K = 20 for most of our experiments.The speed of the algorithm can be further improved by parallelizing the updates for each of K components, which is straightforward as they are updated independent of each other.Another major computational challenge was the MRF updates.Apart from controlling the maximum size of the connected components, we parallelized the MRF updates over each subgraph by making the state generation independent of the previous state.

Experiments
We have evaluated our method on both synthetic and climate data sets.Typical values used for the hyper-parameters were a 0 = b 0 = c 0 = d 0 = 0.01 and λ = 1.Selecting these values within a reasonable range does not affect the results significantly.We made sure that the cardinality of the largest connected component in the constraints graph never exceeds 8.

Synthetic data set
We compared the performance of both constrained and unconstrained versions of our method with the non-parametric mixture of linear regression (NPMLR) model without any regularization.We set up three experiments: (1) to test whether or not our algorithm can learn the correct number of clusters; (2) to evaluate the effect of constraints; and (3) to check the sensitivity of our approach to noise.For all our experiments involving synthetic data, we used N = 1000 data-points and D = 30 features.In our first set of experiments we tested our method for K = 2 . . . 5 actual clusters.Each column of the N × D input matrix X is generated from a uniform distribution.For each value of K, we partitioned the input matrix X in K equal parts X 1 . . .X K .Then for each partition X k (k = 1 . . .K), we generate sparse coefficients β k by randomly selecting 10 out of 30 components to be non-zero.We assign a value of 5 k (where k is the index of the cluster, k = 1, . . ., K) to the non-zero components within the kth cluster so that two clusters are distinctly identifiable in case the indices of non-zero components of the clusters are the same.We then generate the output y k for the kth cluster using the linear regression model of Eq. ( 1).The fixed noise variance τ −1 k for the first experiment was generated by randomly choosing a number between 0 and 0.1 to introduce diversity.A final data set was obtained by merging {X k , y k } for all k = 1 . . .K. The process is repeated 30 times and mean and variance of the evaluation metrics were reported in the form of error bars for each value of K in Fig. 2. For all these experiments, the total number of constraints was kept at 20 per cluster while the size of the largest subgraph was kept below 7.
The second experiment was performed to evaluate the effect of number of "must link" constraints on the performance of the constrained version of the algorithm.Here, the actual number of clusters was fixed at K = 3 along with the base noise variance (0.1) and the number of constraints per cluster was varied from 0 to 30 incremented by 5, although the actual number of constraints may be less since we removed some constraints to achieve sparsity in the constraint graph.The result is reported in Fig. 3.
In our third experiment, we evaluated the effect of noise on the performance of our algorithm.Again, we kept the number of clusters fixed at K = 3 and the number of constraints fixed at 20 per cluster (for the constrained version).We varied the base noise level in each cluster from 0 to 0.5 and added a randomly generated value between 0 and 0.1 with the base noise level for each cluster to maintain diversity among the clusters.Average and variance of 30 repetitions are reported in Fig. 4.

Evaluation metrics
We measured two aspects of the performance of our algorithm.First,we measured whether it can cluster the datapoints correctly.We put a data-point into one of the possible 20 components (since we truncated the infinite DP at K = 20 for all experiments) depending on the value of the row E[Z] n (a vector) in the N × 20 matrix E[Z] estimated by the variational inference algorithm.The estimated cluster membership ĉn (a scalar) is given by ĉn = argmax k E[Z] nk .We retain all the valid components out of 20 possible, which have at least one member initially.Then we run an update algorithm to merge very small clusters with the closest larger  ones.Note that the estimated cluster indices (a value between 1 and 20) may not correspond directly to the actual cluster indices (a value between 1 to actual value of K) since the variational inference algorithm is not aware of the actual order of the cluster indices (e.g., actual cluster index 1 may correspond to estimated cluster index 9).So we use a metric called normalized mutual information (NMI) that evaluates the match between estimated cluster memberships ĉ and actual ones c without needing direct correspondence.NMI is given by NMI , where H (•) is the entropy.Higher NMI values mean that the clustering results are more similar to ground-truth.The metric reaches its maximum value of one when there is perfect agreement.
A second metric is used to evaluate the quality of the sparse regression model estimated within each discovered cluster.Here we are only interested in finding whether our algorithm picks the non-zero coefficients correctly.We use F score to measure the match between actual and estimated non-zero coefficients within each cluster.F score for the kth component is given by , where P k is the precision and R k is the recall of the estimated coefficients for the kth component.We reported the average of F k values over all components discovered by our algorithm.Unlike the previous metric, here we need to know the direct correspondence between the cluster indices so that we can match the actual and estimated coefficient vectors.We developed an algorithm to find such a correspondence based on bipartite matching.

Discussion of results
We can see the performance of all three algorithms are comparable in terms of identifying the clusters correctly, although the NMI value of NPMLR degrades significantly for K = 5.However, as desired, our method outperforms NPMLR in terms of correctly retrieving the sparse structure of regression coefficients within each cluster.There is a general downward trend of performance for all algorithms with increasing number of actual components in the data.This is an inherent problem with the DPM models as it tends to attach each new data-point to the largest current component, thereby favoring models with fewer components.Also, as the number of actual components grows, the probability of two components being similar increases.
The increased flexibility of non-parametric methods comes at a cost of hitting local optima being more likely and finding solutions that are not interpretable.Adding more constraints may decrease this probability but at the same time restricts the variational method from finding solutions leading to a larger lower bound, especially in the presence of more components in the data.Therefore increasing the number of constraints may result in more interpretable solutions, but not improved accuracy.It is also encouraging to see that our method is relatively robust to added noise, a major challenge with the real data sets, especially in terms of correctly identifying the sparse structure.

Feature selection for downscaling rainfall
A grand challenge in climate science relevant for adaptation and policy remains our inability to provide credible stakeholder-relevant "statistical downscaling", or to develop statistical techniques for more accurate, precise and interpretable high-resolution projections with lower-resolution climate model data (Benestad et al., 2008).Regression models of statistical downscaling (Benestad et al., 2008;Ghosh, 2010) work by first selecting a set of climate variables that have information about the target variable, and then fitting a regression model to predict the target variable at higher resolution.In this application, selecting the right set of predictors is as important as building a prediction model since even a good prediction with a model that is physically not interpretable is less desirable as it may not generalize well.We focus on the feature selection problem for statistical downscaling of annual average rainfall.The use of annual averages reduces the amount of noise in the observed rainfall data, which enables us to examine the robustness of our methods with less ambiguity.
Existence of multiple states or patterns is acknowledged in regression-based statistical downscaling literature for rainfall (e.g., Kannan and Ghosh, 2010) where parametric methods such as k-means were used to find distinct clusters.Here we used our model to simultaneously find clusters, if any, and select features for the purpose of statistical downscaling of station-observed annual average rainfall over two climatologically homogeneous regions over the continental US.Since rainfall follows a log-normal distribution (Kedem and Chiu, 1987), the target variable we used is logarithm of annual average rainfall.In Fig. 6, we show the distribution of average rainfall over all sites in western US before and after taking the logarithm.
Potential features used can fall in one of two broad categories -local atmospheric variables and large-scale climate indices.Local covariates originate from each station and exhibit both spatial and temporal variability.Annual and seasonal averages of maximum temperature fall in this category along with sea level pressure (SLP), and convective available potential energy (CAPE).A dependence on any of these variables roughly indicates dominance of local convective rainfall in the region.Daily rainfall station data were obtained from US Historical Climatology Network (USHCN) (Easterling et al., 1996).All other features are described in Table 1.Table 1.Potential features used for statistical downscaling of rainfall.
Atmospheric (Easterling et al., 1996;Mesinger et al., 2006 1.A dependence on any of these variables roughly indicates rainfall due to large-scale circulation.In addition to these covariates, we have used elevation as a potential feature which falls under none of the above categories.This is the only feature that represents the geography of the region.
We could use the covariates between 1979 and 2011 as SLP and CAPE is available only for that period.Also, if more than 50 % of the daily observations in a year are found to be missing for any covariate at a specific location, we simply discarded all covariates for that year and for that specific location.We averaged monthly climate indices and daily local variables over a year.Finally the annual/seasonal average time-series of predictors for each station were merged for a homogeneous region under consideration.West (CA,

Results and discussion
We applied spatial "must-link" constraints among pairs of data-points from the same location.Ideally, if there are n points in a cluster, we will be required to put n 2 constraints to cover all pairs of data-points.To reduce complexity, initially we kept only those constraints that connect datapoints from consecutive years.However, this reduced set of constraints proved to be too restrictive and all data-points tended to merge into a single cluster.So, we kept removing the constraints in an intuitive manner until more than one cluster emerged for a region.We found more than one cluster for all regions except the southern region.We stopped removing constraints until new clusters stopped emerging for a region.Here we show only the clusters in the western and northwestern regions, since the majority of stations were mostly split into obtained clusters in these regions.In other regions, almost all stations had mixed membership.We assign a station to a cluster if more than 80 % of its data-points belong to that cluster.
A quick look at the histogram of target variable (right panel in Fig. 6) also supports the possibility of two distinct rainfall modes in the region.As mentioned earlier, we obtained one sparse linear model for each of the discovered components within a region.Since a non-zero coefficient in the sparse model implies dependence on the corresponding covariate, we can obtain interesting insights about the dependence of average rainfall on various atmospheric and climate indices from the coefficients of the individual sparse models within each cluster.Interestingly, in the northwest region there is only a single member station in the first component that exhibits dependence on the local temperature variables and SLP, whereas the larger cluster shows dependence on a larger number of climate indices.In the western region, the first cluster shows dependence on local temperature variables and the second cluster shows more dependence on largescale variables.Both clusters show dependence on elevation.While dependence on large-scale indices is not surprising for both these coastal regions due to the known effect of westerlies, dependence of smaller clusters (especially in the northwest) on local variables may hint at the existence of some regional small-scale atmospheric mechanisms.While spatially coherent clusters are more likely to occur in nature, geographical features such as mountains and lakes and even man-made structures such as large dams and reservoirs may abruptly disturb the spatial smoothness of clusters, since their presence may alter the climate pattern of the nearby areas with respect to the surrounding regions.However, before we can build statistical downscaling models, more rigorous statistical and physical analysis is required based on these preliminary insights obtained using our method.The clusters discovered here, and the corresponding covariates, can be utilized to develop individual non-linear prediction models per cluster.
DPMs automatically find the number of clusters K and adapt to varying values of K.However, DPMs prevent the model from "learning" an unnecessarily large value of K if a smaller K is sufficient to describe the model, thus managing complexity.Based on the results of experiments on the synthetic data set shown in Fig. 2, we found that the performance of the method degrades as the number of components K grows larger.We believe it is reasonable to expect that there will only be a limited number of distinct relationships between average rainfall and their covariates when we apply our method at the regional scale.However, even in situations where a large number of relationships exist within a particular region, our method may not be able to identify all of the distinct methods, but it can nevertheless be expected to outperform the use of a single model.The single model will attempt to learn a relationship that is the average of all distinct relations, while our approach will still attempt D. Das et al.: Non-parametric Bayesian mixture of sparse regressions to distinguish among major categories of relationships even though some of them may be lumped together.

Conclusions
In this paper, we propose a nonparametric Bayesian mixture of sparse regression models for simultaneous clustering and discovery of covariates within each cluster using a DP mixture model.Moreover, our model can accommodate prior knowledge about "must link" constraints between the pair of data-points using a Markov Random Field prior on the cluster membership variables.Our major contribution is to develop an efficient and scalable variational inference algorithm for inference on the fully Bayesian model.We applied our method to both synthetic and real climate data and successfully discovered multiple underlying behaviors in the data.Preliminary results of applying our method to feature selection for statistical downscaling of rainfall show promise towards finding new climate insights with appropriate caveats.Going forward, we would like to incorporate priors for diversity among the clusters in order to discourage merging of close but dissimilar clusters.We intend to extend our model for predictive analysis and build a full-scale statistical downscaling method using the features selected by the current model.

Figure 1 .
Figure 1.Graphical representation of the complete Bayesian hierarchical model.

D.
Das et al.: Non-parametric Bayesian mixture of sparse regressions with

Figure 2 .
Figure 2. Left panel: ability of nonparametric unregularized and sparse regressions (unconstrained and constrained) to correctly identify clusters in presence of increased number of actual components in the data.Right panel: ability of nonparametric unregularized and sparse regressions (unconstrained and constrained) to correctly retrieve the sparse structure within each cluster.

Figure 3 .
Figure 3. Performance of the constrained version of the algorithm (in terms of NMI (more the better)) with number of "must link" constraints.

Figure 4 .
Figure 4. Left panel: ability of nonparametric unregularized and sparse regressions (unconstrained and constrained) to correctly identify clusters (indicated by NMI) with increasing noise.Right panel: ability of nonparametric unregularized and sparse regressions (unconstrained and constrained) to correctly retrieve the sparse structure within each cluster (indicated by average F score).
Figure  5shows the climatologically homogeneous regions over the US.

Figure 5 .
Figure 5. Map showing climatologically homogeneous regions over continental US.

Figure 6 .
Figure 6.Left panel: distribution of average rainfall over all sites in the western US.Right panel: distribution of average rainfall after transformation.

Figure 7 .
Figure 7. Left panel: location of stations and their cluster membership in the western region.Right panel: location of stations and their cluster membership in the northwestern region. )