The purpose of a sensitivity analysis (SA) is to quantify the influence of different parameters on the model outputs of interest. The SA process works as follows: (1) identify the adjustable parameters and their feasible ranges, (2) generate samples in high-dimensional parameter space and run the model with those samples, and (3) choose one or more appropriate SA methods and objective functions to quantify the parameter sensitivity with the input parameters and output variables. The adjustable parameters and their feasible ranges are presented in “Results and discussion” section. Below, we provide a brief description of the sampling method, SA method and tools used in this study.
Sampling method
In previous research (Gong et al. 2016a), we have found that the good lattice points (GLP) method can generate relatively more uniform samples than other methods, such as the widely used Monte Carlo and Latin Hypercube methods; therefore, we choose this sampling method in this study. The GLP method is also called the Korobov lattice rules (Hlawka 1962; Korobov 1959a; Korobov 1959b; Korobov 1960), which is a number theory-based quasi-Monte Carlo (QMC) method. The GLP design is generated by the following equations:
$$ \left\{\begin{array}{c}{q}_{ki}=k{h}_i\left(\mathit{\operatorname{mod}}\ n\right)\\ {}{x}_{ki}=\left(2{q}_{ki}-1\right)/n\end{array},k=1,\cdots, n;i=1,\cdots, s\right. $$
(1)
where n represents the number of samples, s represents the number of dimensions, xki represents the coordinate of the kth sample point in the ith dimension, qki represents an internal variable, and hi represents an element in the generating vector. The range of coordinate xki is restricted to [0,1]. The greatest common divisor of hi and n is 1. The vector (n : h1, ⋯, hs) is called the generating vector. If the point set Pn = {xk = (xk1, ⋯, xks), k = 1, ⋯, n} is more uniform than any other generating vectors, then the point set Pn is selected as the GLP set. With the uniformly scattered samples generated by the GLP method, we can cover the parameter space with less samples and, thus, save computational resource costs for the sensitivity analysis.
Sensitivity analysis methods
This study employed two qualitative SA methods to perform the parameter screening: multivariate adaptive regression splines (MARS) and random forests (RF). Moreover, to validate the parameter screening results obtained by the qualitative methods, the sparse polynomial chaos expansion (PCE)-based Sobol’ method (SPC) was applied to compute the total effects of the parameters.
Multivariate adaptive regression splines
The MARS method (Friedman 1991) is a generalization of the stepwise linear regression, and it is suitable for high-dimensional problems. We call the two expanded piecewise linear basis functions (x − t)+ and (t − x)+ used in the MARS the reflected pair, where t is a constant called the knot. Our aim is to form reflected pairs for each parameter Xj with knots at each xij value for that input. Therefore, the collection of basic functions is
$$ \mathrm{C}=\left\{{\left({X}_j-t\right)}_{+},{\left(t-{X}_j\right)}_{+}\right\}\left(t\in \left\{{x}_{1j},{x}_{2j},\cdots, {x}_{Nj}\right\},j=1,2,\cdots, p\right) $$
(2)
where N represents the number of samples, p represents the total number of adjustable parameters, and Xj represents the j-th adjustable parameter.
The MARS method includes a forward procedure and backward procedure. First, we build a forward stepwise linear regression using the functions from the set C and its products. Thus, the model has the form
$$ f(X)={\beta}_0+\sum \limits_{m=1}^M{\beta}_m{h}_m(X) $$
(3)
where β0 represents the intercept, βm represents the slope, and f(X) corresponds to the predicted value of the observable variables (i.e., the output variables), such as temperature or precipitation. Both β0 and βm are regression coefficients in the regression model, and their values are estimated by minimizing the residual sum-of-squares. Each hm(X) is a function of C or a product of two or more similar functions, and M represents the number of functions. Equation (3) is a regression model that can predict the value of the observable variable yi with the parameter value X.
This model typically overfits the data; therefore, a backward deletion procedure should be applied. The term whose removal causes the smallest increase in the residual squared error is deleted from the model at each stage to produce an estimated best model (\( \hat{f_{\lambda }} \)) for each size (number of terms) of λ. The MARS procedure uses generalized cross-validation (GCV) to estimate the optimal value of λ:
$$ \mathrm{GCV}\left(\lambda \right)=\frac{\sum \limits_{i=1}^n{\left({y}_i-\hat{f_{\lambda }}\left({x}_i\right)\right)}^2}{{\left(1-M\left(\lambda \right)/n\right)}^2} $$
(4)
where n represents the number of observations, yi represents the ith observation, \( \hat{f_{\lambda }}\left({x}_i\right) \) represents the estimated value of yi, and M(λ) represents the number of effective parameters in the model.
The importance of the removed variable is measured by the increase in GCV values between the pruned model and overfitted model (Steinberg et al. 1999). The greater the increase in GCV is, the more important the removed variable.
The MARS method can also be used as a surrogate-model. Shahsavani et al. (2010) showed that using the MARS surrogate model to replace the original dynamic model can provide acceptable estimates of the total sensitivity indices at much lower costs.
Random forest
The random forest (RF) is a very efficient and increasingly popular machine-learning algorithm for both classification and regression problems that was introduced by Breiman (2001). RFs are a substantial modification of bagging (Breiman 1996) that construct multiple trees (i.e., forests) using bootstrap sampling, and their decisions are averaged. The main difference between RFs and bagging is that a RF searches a randomized subset of input variables to determine a split at each node, which is the reason why it is called “random” forests. The basic principle of RFs is a group of weak learners that can come together to form a strong learner.
The random forest algorithm is as follows:
-
1.
Extract a bootstrap sample, Z*, of size N from the training data.
-
2.
For each of the bootstrap samples, grow a random forest tree by recursively repeating the following steps ((a) to (c)) for each terminal node of the tree until the minimum node size nmin is reached.
-
(a)
Randomly select m variables from the total p variables, where m < < p.
-
(b)
Among the m variables, pick up the best variable/split point.
-
(c)
Split the node into two daughter nodes using the best split.
-
3.
Predict the new data by aggregating the predictions of the N trees.
Compared with other classification and regression techniques, the RF has its unique advantages: RFs can avoid overfitting because of the law of large numbers, and RFs can be used to identify important factors. The total number of splits can determine the importance of this variable. The more splits the variable has, the more sensitive the variable is.
Sparse PCE-based Sobol’ method
The Sobol’ method (Sobol’ 1993) is a quantitative SA method based on the principle of variance decomposition and can be applied to nonlinear, nonmonotonic mathematical models. Its core idea is to decompose the total variance of the objective function into the variance of a single parameter and the variance of the interaction between parameters. By comparing the three methods of PEST, RSA, and ANOVA with the Sobol’ method, Tang et al. (2006) considered that the Sobol’ method is more robust and superior to other methods in both single objective and multi-objective sensitivity analyses.
Suppose the problem can be considered as y = f(X) = f(X1, ⋯, Xk), where y represents the objective function of the model output (e.g., the root mean square errors of the simulated values and default values), and X = (X1, ⋯, Xk) is the vector of the k model factors (e.g., parameters) to be used to control the behavior of the model. Without loss of generality, each parameter Xi was supposed to be feasible in range [0, 1]. Our purpose is to explore how much of the total variance D(y) in y can be explained by variability in the factors of X. The Sobol’ method computes this by decomposing the function f(X) into terms of increasing dimensionality, such that each successive dimension represents how much the interaction between parameters increases.
$$ f\left({X}_1,{X}_2,\cdots {X}_k\right)={f}_0+\sum \limits_{i=1}^k{f}_i\left({X}_i\right)+\sum \limits_{1\le i<j\le k}{f}_{ij}\left({X}_i,{X}_j\right)+\cdots +{f}_{1,2,\cdots, k}\left({X}_1,\cdots, {X}_k\right) $$
(5)
where f0 is constant that equals to the expected value of f(X), fi(Xi) is a function of the ith parameter, fij(Xi, Xj) is a function of the ith and jth parameters, etc. The integrals of the decomposed functions (also called summands) fi(Xi), fij(Xi, Xj), ⋯, f1, 2, ⋯, k(X1, ⋯, Xk) are equal to zero:
$$ {\int}_0^1{f}_{i_1,{i}_2,\cdots, {i}_s}\left({X}_{i_1},\cdots, {X}_{i_s}\right)d{x}_{i_k}=0\ if\ 1\le k\le s $$
(6)
All the summands can be computed recursively like this:
$$ {\displaystyle \begin{array}{c}{f}_0=\underset{0}{\overset{1}{\int }}\dots \underset{0}{\overset{1}{\int }}f\left(\boldsymbol{X}\right)d\boldsymbol{X},\\ {}{f}_i\left({X}_i\right)=\underset{0}{\overset{1}{\int }}\dots \underset{0}{\overset{1}{\int }}f\left(\boldsymbol{X}\right)d{\boldsymbol{X}}_{\sim i}-{f}_0,\\ {}{f}_{ij}\left({X}_i,{X}_j\right)=\underset{0}{\overset{1}{\int }}\dots \underset{0}{\overset{1}{\int }}f\left(\boldsymbol{X}\right)d{\boldsymbol{X}}_{\sim ij}-{f}_0-{f}_i\left({X}_i\right)-{f}_j\left({X}_j\right),\end{array}} $$
(7)
The notation ~ means the parameters are excluded like this X~i = (X1, … , Xi − 1, Xi + 1, … , Xk).
The total variance of the function f(X) is defined as:
$$ D(Y)={\int}_0^1\cdots {\int}_0^1{f}^2\left(\boldsymbol{X}\right)d\boldsymbol{X}-{f}_0^2 $$
(8)
And the contribution of a generic term \( {f}_{i_1,\cdots, {i}_s}\left(1\le {i}_1<\cdots <{i}_s\le k\right) \) to the total variance can be written as
$$ {D}_{i_1,\cdots, {i}_s}={\int}_0^1\cdots {\int}_0^1{f}_{i_{1,\cdots, {i}_s}}^2\left({X}_{i_1},\cdots, {X}_{i_s}\right)d{X}_{i_1}\cdots d{X}_{i_s} $$
(9)
where \( {D}_{i_1,\cdots, {i}_s} \) denotes the partial variance corresponding to (i1, ⋯, is), the integer s is called the order or the dimension of the index. On this basis, the total variance of the output variable Y can be decomposed into of all partial variances:
$$ D(y)=\sum \limits_{i=1}^k{D}_i+\sum \limits_{1\le i<j\le k}{D}_{ij}+\cdots {D}_{1,2,\cdots k} $$
(10)
where Di represents the contribution of factor Xi to D(y) and Dij represents the contribution of the interaction between factors Xi and Xj. Similarly, D1, 2, ⋯k represents the contribution by the interaction of k factors. The Sobol’ sensitivity index of s factors is defined as
$$ {S}_{i_1,\cdots, {i}_s}=\frac{D_{i_1,\cdots {i}_s}}{D(Y)},1\le {i}_1<\cdots <{i}_s\le k $$
(11)
and the sum of all Sobol’ sensitivity indices equals to 1:
$$ 1={\sum}_{i=1}^n{S}_i+{\sum}_{1\le i<j\le k}{S}_{ij}+\cdots +{S}_{1,2,\cdots, k} $$
(12)
In the Sobol’ method, Si = Di/D(Y) is the main effect (i.e., the first-order effect) of the ith variable, and Sij = Dij/D(Y) is the interaction effect (i.e., second-order effect) of the ith and jth variables. STi = 1 − D~i/D(Y) represents the total sensitivity of the ith variable, where D~i represents total variance excluding the ith variable.
Total effect reflects the parameter’s overall contribution to the total variance. The total effect of a factor Xi is the sum of the first-order effect (main effect) and all other order effects involving Xi, including two factor interaction effects and all higher order interaction effects.
Traditionally, the Sobol’ indices are evaluated with Monte Carlo sampling, and to obtain an accurate estimation the number of sample size (equals to number of model simulation) is usually very high. Sobol’ method is too expensive for computationally expensive models. Using surrogate model to replace the expensive dynamic model in Sobol’ method can significantly save the computational resources. Sudret (2008) proposed a post-processing of polynomial chaos expansions (PCE) that can directly obtain the Sobol’ sensitivity indices from the polynomial coefficients of PCE. Comparing with the Monte Carlo sampling and the PCE-based way of computing Sobol’ indices, Sudret’s PCE-based method cost less computational resources and can obtain more accurate result.
Here is a brief introduction to sparse PCE Sobol’ method. Considering a random vector with independent components X ∈ ℝk and a computational model Y = f(X) with finite variance, the polynomial chaos expansion of f(X) is defined as:
$$ Y=f\left(\boldsymbol{X}\right)=\sum \limits_{\boldsymbol{\alpha} \in {\mathbb{N}}^k}{\lambda}_{\boldsymbol{\alpha}}{\varPsi}_{\boldsymbol{\alpha}}\left(\boldsymbol{X}\right) $$
(13)
where the Ψα(X) are multivariate polynomials orthonormal with respect to the distribution of X , α ∈ ℕk is a vector of indices that identifies the components of the multivariate polynomials Ψα, and the corresponding λα ∈ ℝ are the coefficients of each orthonormal. Equation (13) is usually referred to as the polynomial chaos expansion (PCE) of Y. In realistic applications, the truncated polynomial chaos expansions are usually introduced to retain only a finite number of PCE terms:
$$ f\left(\boldsymbol{X}\right)\approx {f}^{PC}\left(\boldsymbol{X}\right)=\sum \limits_{\boldsymbol{\alpha} \in \mathcal{A}}{\lambda}_{\boldsymbol{\alpha}}{\varPsi}_{\boldsymbol{\alpha}}\left(\boldsymbol{X}\right) $$
(14)
In this equation, \( \mathcal{A} \) is the set of vectors of selected indices of multivariate polynomials. This equation is called the full polynomial chaos. The key of constricting a sparse polynomial chaos (SPC) is to determine the coefficients λα of each term. In this paper, we used the orthogonal matching pursuit (OMP) originally proposed by Pati et al. (1993) to determine the polynomial coefficients.
The Sobol’ indices can be computed from the polynomial coefficients λα directly as follows:
$$ {\displaystyle \begin{array}{l}E\left(f\left(\boldsymbol{X}\right)\right)\approx {\lambda}_0,\\ {}V\left(f\left(\boldsymbol{X}\right)\right)\approx \sum \limits_{\underset{\alpha \ne 0}{\alpha \in A}}{\lambda}_{\boldsymbol{\alpha}}^2,\\ {}{S}_i\approx \frac{1}{V\left(f\left(\boldsymbol{X}\right)\right)}\sum \limits_{\boldsymbol{\alpha} \in {\mathcal{A}}_{S_i}}{\lambda}_{\boldsymbol{\alpha}}^2\ \mathrm{with}\ {\mathcal{A}}_{S_i}=\left\{\boldsymbol{\alpha} :{\alpha}_i>0,{\alpha}_k=0\ \mathrm{for}\ k\ne i\right\},\\ {}{S}_{Ti}\approx \frac{1}{V\left(f\left(\boldsymbol{X}\right)\right)}\sum \limits_{\boldsymbol{\alpha} \in {\mathcal{A}}_{S_{Ti}}}{\lambda}_{\boldsymbol{\alpha}}^2\ \mathrm{with}\ {\mathcal{A}}_{S_{Ti}}=\left\{\boldsymbol{\alpha} :{\alpha}_i>0\right\},\end{array}} $$
(15)
where \( {\mathcal{A}}_{S_i} \) is the set of indices vectors that only have the ith factor, while \( {\mathcal{A}}_{S_{Ti}} \) is the set of indices vectors that have the ith factor and maybe also others, E(f(X)) and V(f(X)) are mean and variance of f(X), respectively. The confidence intervals of Sobol’ sensitivity indices can be estimated with the bootstrap method (Efron 1979).
Tools
The Uncertainty Quantification Python Laboratory (UQ-PyL) (Wang et al. 2016) is a flexible software platform designed to quantify uncertainties in large complex dynamical models. UQ-PyL integrates different kinds of UQ methods, including experimental design, statistical analysis, sensitivity analysis, surrogate modeling, and parameter optimization. In this study, we used the unreleased developed version of the UQ-PyL for the experimental design and sensitivity analysis. The MARS algorithm is available from the open-source software py-earth (https://github.com/scikit-learn-contrib/py-earth), the Sobol’ method implemented in SALib (https://github.com/SALib/SALib), and the Sparse PCE-based Sobol’ method we used comes from UQLab (https://www.uqlab.com/).