Abstract
Transverse mixing coefficient (TMC) is known as one of the most effective parameters in the two-dimensional simulation of water pollution, and increasing the accuracy of estimating this coefficient will improve the modeling process. In the present study, genetic algorithm (GA)-based support vector machine (SVM) was used to estimate TMC in streams. There are three principal parameters in SVM which need to be adjusted during the estimating procedure. GA helps SVM and optimizes these three parameters automatically in the best way. The accuracy of the SVM and GA-SVM algorithms along with previous models were discussed in TMC estimation by using a wide range of hydraulic and geometrical data from field and laboratory experiments. According to statistical analysis, the performance of the mentioned models in both straight and meandering streams was more accurate than the regression-based models. Sensitivity analysis showed that the accuracy of the GA-SVM algorithm in TMC estimation significantly correlated with the number of input parameters. Eliminating the uncorrelated parameters and reducing the number of input parameters will reduce the complexity of the problem and improve the TMC estimation by GA-SVM.
HIGHLIGHTS
Genetic algorithm (GA)-based support vector machine (SVM) was used to estimate TMC in streams.
Sensitivity analysis showed that the accuracy of GA-SVM algorithm in TMC estimation significantly correlated with the number of input parameters.
INTRODUCTION
Increasing the accuracy of modeling the process of pollution release into streams will increase the ability to control the quality of streams and thereby reduce environmental damage. Therefore, the capability to estimate the transport of pollutants in streams and waterways has always been a considerable issue in many industrial and environmental projects (Abderrezzak et al. 2015). After being discharged into a river, contaminants and effluents mix with water of the river being transported to the downstream (Seo & Cheong 1998). The effluent is spread vertically, transversely, and longitudinally by advective and dispersive transport processes. In a shallow stream, after contamination is rapidly mixed throughout the depth, the transmission will occur in the longitudinal and transverse directions (Ahmad et al. 2011). A full cross-sectional mix will not be achieved, unless the pollutant travels the long distances which are generally not within the length of practical interest (Beltaos 1980). The length required for full cross-sectional mixing of contaminations is approximately 20 and 200 times the upper width for a rough and a smooth flow, respectively (Fischer 1967). Transverse mixing plays an important role in determining the effect of contaminants under steady-state conditions. This parameter has an important effect in water quality management; especially in a case of point source discharges or tributary inflows (Rutherford 1994; Boxall & Guymer 2003). According to Figure 1, for the effluent mixing process in rivers, three stages are considered: (1) mixing near to the discharging point due to initial momentum and flow buoyancy (between A and B zones); (2) transverse mixing due to turbulence (secondary turbulence transfer) and its secondary flows (between B and C zones); and (3) dispersion due to longitudinal shear flow (after C zone) (Fischer et al. 1979).
The above equation has been used in many studies (Krishnappan & Lau 1977; Lau & Krishnappan 1981; Demetracopoulos 1994; Ahmad 2008; Aghababaei et al. 2017; Huai et al. 2018; Zahiri & Nezaratian 2020). More investigations on the role of the effective parameters in transverse mixing would be required due to the complexity of the transverse mixing mechanism (Aghababaei et al. 2017). Thus, predicting the transverse mixing coefficient (TMC) for known flow conditions in a stream for accounting the pollutant concentration at any location downstream of the injection site is genuinely essential (Azamathulla & Ahmad 2012). Generally, there are three approaches for predicting the TMC in stream mixing. Empirical methods have developed equations using the hydraulic and geometric dataset of rivers and experimental studies in order to establish a relationship for and theoretical methods have used the concept of shear flow to derive the dispersion coefficient (Baek & Seo 2013). Moreover, many researchers have recently used powerful predictive tools to find solutions for complex engineering problems. The significance of dispersion coefficients in water quality modeling and the complexity of the pollutant emission and mixing process have considerably increased the importance of using these tools (Zahiri & Nezaratian 2020). Soft computing techniques such as fuzzy-neural inference system-based principal component analysis (ANFIS-based PCA), particle swarm optimization method (PSO), artificial neural network (ANN), genetic expression programming (GEP), differential evolution (DE), decision tree (M5), support vector machine (SVM), and fuzzy-neural inference system (ANFIS) have been widely used to estimate longitudinal dispersion coefficient in streams by Parsaei et al. (2018), Alizadeh et al. (2017), Antonopoulos et al. (2015), Sattar & Gharabaghi (2015), Li et al. (2013), Etemad-Shahidi & Taghipour (2012), Azamathulla & Wu (2011) and Riahi-Madvar et al. (2009). Azamathulla & Ghani (2011), Azamathulla & Ahmad (2012), Aghababaei et al. (2017), and Zahiri & Nezaratian (2020), tried to predict the TMC accurately by using decision tree (M5), multivariate adaptive regression splines (MARS), particle swarm optimization method (PSO), multiple linear regression (MLR), genetic algorithm (GA), genetic programming for symbolic regression (GPSR), and GEP. Soft computing techniques used by the above-mentioned researchers have less statistical errors and higher accuracy than empirical methods in TMC prediction (Zahiri & Nezaratian 2020). According to previous studies, there is a strong relationship between the TMC and channel parameters such as channel width, flow depth, shear velocity, friction factor, curvature and sinuosity (Fischer 1967; Beltaos 1979; Lau & Krishnappan 1981; Stefanovic & Stefan 2001; Boxall & Guymer 2003). Table 1 shows some of the most well-known equations proposed for calculating the TMC.
Reference . | Formula . |
---|---|
Fischer & Park (1967) | |
Yotsukura et al. (1970) | |
Chau (2000) | |
Ahmad (2007) | |
Jeon et al. (2007) | |
Azamathulla & Ahmad (2012) | and |
Aghababaei et al. (2017) (GPSR method) | |
Zahiri & Nezaratian (2020) (M5 method) | |
Reference . | Formula . |
---|---|
Fischer & Park (1967) | |
Yotsukura et al. (1970) | |
Chau (2000) | |
Ahmad (2007) | |
Jeon et al. (2007) | |
Azamathulla & Ahmad (2012) | and |
Aghababaei et al. (2017) (GPSR method) | |
Zahiri & Nezaratian (2020) (M5 method) | |
is the TMC (m2/s), H is the flow depth (m), is a bed shear velocity (m/s), W is a channel width (m), is sinuosity coefficient and is a Froude number.
Each of these mentioned algorithms has its strengths and weaknesses that may not be able to predict complex phenomena such as TMC accurately. Selecting several meta-heuristic algorithms correctly and using them simultaneously will increase accuracy and decrease errors in target values’ estimation. Selecting an algorithm as the main algorithm along with an auxiliary algorithm that can improve the weaknesses of the main algorithm will lead to developing a hybrid algorithm with higher performance. In previous investigations, several hybrid algorithms were used to estimate some of the complex phenomena and, consequently, the ability of these algorithms was proven completely (Pourbasheer et al. 2009; Wang et al. 2013; Li & Kong 2014; Zhou et al. 2016). In this study, two common algorithms were used to develop a hybrid algorithm: support vector machine (SVM) as the main algorithm and genetic algorithm (GA) as the auxiliary algorithm. Connecting GA to SVM allows us to estimate optimal values of SVM's adjustable parameters in the shortest time and increase predicting accuracy. The purpose of this study is developing an SVM-GA algorithm by using 232 published datasets and making a comparison of its performance with previous models. In addition, sensitivity analysis has been performed on the developed model to determine the effect of input parameters in the TMC modeling.
MATERIALS AND METHODS
Data
In the present study, 232 data points (see Supplementary material) were collected from the technical literature (Yotsukura et al. 1970; Holley & Abraham 1973; Krishnappan & Lau 1977; Beltaos 1979; Rutherford 1994; Jeon et al. 2007; Baek & Seo 2008; Lee & Seo 2013). It must be added that 183 and 49 dataset have been collected from straight and meandering streams, respectively. In addition, the dataset contains geometrical and hydraulic characteristics, including channel width, channel depth, average velocity, shear velocity, Froude number, sinuosity, and TMC. Sinuosity was used to demonstrate horizontal irregularities in meandering streams (Aghababaei et al. 2017). Table 2 illustrates a statistical analysis of all variables.
Parameter . | W . | H . | U . | U* . | W/H . | U/U* . | Fr . | Sn . | ε z/HU* . | ε z . |
---|---|---|---|---|---|---|---|---|---|---|
Min | 0.200 | 0.013 | 0.040 | 0.005 | 1.670 | 2.051 | 0.018 | 1.000 | 0.054 | 0.000034 |
Max | 320.000 | 5.250 | 1.750 | 0.163 | 287.500 | 28.571 | 0.971 | 3.330 | 2.400 | 0.215 |
Avg | 15.950 | 0.304 | 0.308 | 0.026 | 26.710 | 12.976 | 0.285 | 1.108 | 0.238 | 0.007 |
SD | 51.237 | 0.709 | 0.271 | 0.023 | 34.995 | 5.447 | 0.181 | 0.371 | 0.249 | 0.025 |
Skewness | 4.246 | 4.506 | 2.947 | 2.379 | 3.797 | 0.196 | 0.866 | 4.974 | 4.510 | 5.246 |
Parameter . | W . | H . | U . | U* . | W/H . | U/U* . | Fr . | Sn . | ε z/HU* . | ε z . |
---|---|---|---|---|---|---|---|---|---|---|
Min | 0.200 | 0.013 | 0.040 | 0.005 | 1.670 | 2.051 | 0.018 | 1.000 | 0.054 | 0.000034 |
Max | 320.000 | 5.250 | 1.750 | 0.163 | 287.500 | 28.571 | 0.971 | 3.330 | 2.400 | 0.215 |
Avg | 15.950 | 0.304 | 0.308 | 0.026 | 26.710 | 12.976 | 0.285 | 1.108 | 0.238 | 0.007 |
SD | 51.237 | 0.709 | 0.271 | 0.023 | 34.995 | 5.447 | 0.181 | 0.371 | 0.249 | 0.025 |
Skewness | 4.246 | 4.506 | 2.947 | 2.379 | 3.797 | 0.196 | 0.866 | 4.974 | 4.510 | 5.246 |
Based on Figure 2, there is no considerable correlation between the input variables, thus the problems that could arise in analysis from exaggerating the strength of the relations between variables, would be eliminated (Sattar & Gharabaghi 2015). It should be noted that the average of each parameter in training and testing subsets is equal to (13.36, 25.63, 0.29, 1.12, 0.25) and (11.91, 30.18, 0.27, 1.06, 0.20), respectively.
Support vector machine (SVM)
Vapnik (1995) proposed a nonlinear regression predicting method called support vector machine (SVM) which was usable to solve pattern recognition, highly nonlinear classification and regression problems. Maximizing the accuracy of prediction or minimizing the difference between the outputs and targets was the purpose of developing the SVM (Parsaie & Haghiabi 2017a, 2017b; Parsaie et al. 2019). For this purpose, the input parameters are mapped into a high-dimensional linear feature space by a nonlinear transformation to construct the optimal decision function. The dot product operation in the higher dimensional feature space is replaced by the kernel function in the original space, and by the finite sample training, the global optimal solution is obtained (Zhou et al. 2016). In the current study, SVM is used for predicting the TMC as the main algorithm, which is briefly described below.
Genetic algorithm (GA)
According to the mechanisms of genetics and Darwin's natural selection principles, John Holland in 1975, proposed a heuristic search method and called it the genetic algorithm (GA). This method was named after biological processes of inheritance, mutation, natural selection, and the genetic crossover that happens when parents mate to produce offspring (Goldberg 1989). Technically, there are four differences between the structure of GA and other traditional optimization algorithms (Goldberg 1989):
The GA typically uses a coding of the decision variable set instead of decision variable itself.
The GA searches from a population of decision variable sets instead of a single decision variable set.
The GA uses the objective function itself instead of the derivative information.
The GA algorithm uses probabilistic instead of deterministic, search rules.
In the last decade, GA has successfully been used to solve some problems such as fitting nonlinear regression to data, optimizing simulation models, solving systems of nonlinear equations, and machine learning (Deb 1998). Generally, a GA has five major components to solve a particular problem that are briefly described below:
- 1
At the first, n chromosomes generate a population randomly that are known as candidate solutions to the problem.
- 2A special fitness function evaluates the fitness of each chromosome. In the present study, efficiency coefficient (EC) was used as the fitness function and it can be written as:where N represents the total number of a testing data and is the predicted value. is the observed value and is the mean of the observed values.
- 3
The following steps will be repeated until n offsprings have been created:
- (a)
Selection: This operator selects the best chromosomes in pairs from the population to play the role of parents and reproduce two offspring. The more appropriate chromosomes have more chances to be selected.
- (b)
Crossover: This operator randomly chooses a locus between a couple of chromosomes to form two offspring.
- (c)
Mutation: This operator creates new chromosomes by flipping some of the bits in the chromosomes randomly.
- (a)
- 4
Replace the current population with the new population.
- 5
If the stopping condition is satisfied, the best solution is returned in the current population, otherwise step 2 should be performed again.
The applied GA method settings in the present study are shown in Table 3.
Population size | 250 |
Number of generations | 10 |
Elitism | 12 |
Crossover probability | 0.8 |
Mutation probability | 0.1 |
Crossover function | Scatter |
Mutation function | Gaussian |
Population size | 250 |
Number of generations | 10 |
Elitism | 12 |
Crossover probability | 0.8 |
Mutation probability | 0.1 |
Crossover function | Scatter |
Mutation function | Gaussian |
Genetic algorithm-based support vector machine
In this study, at first, the training data (input and target parameters) are presented to the GA-SVM algorithm. Then, GA randomly generates an initial population of unknown SVM's parameters (, , and ) to determine their optimal values to approach the best prediction with the lowest error and the highest accuracy. The fitness function examines the performance of each model. The secondary population of SVM's parameters is created by using the operators of GA (mutation, crossover, and selection) to obtain the optimal values of parameters and then these parameters are introduced to the SVM algorithm, again. This cycle is continued until the value of the fitness function is near or equal to the stopping conditions of the algorithm. Therefore, model outputs are expected to be closer to the target values at each cycle. In the GA-SVM algorithm, both algorithms operate separately but help each other in order to simplify the problem. In other words, first, SVM starts modeling by using the random parameters generated by GA, and GA continues the procedure of modeling until the optimal values of SVM's parameters are obtained. In this method, the GA algorithm tries to estimate the optimal combination of three parameters (, and ) in each cycle. C is known as a regularization parameter that must control the trade-off between maximizing the margin and minimizing the training error. Low C values will place insufficient stress on fitting the training data and high values of C make the algorithm over-fit the training data (Noori et al. 2011). Nevertheless, according to Wang et al. (2003), it can be concluded that the prediction error is rarely influenced by C. denotes the optimal width of the kernel function, while RBF with large allows the support vector to have a strong impact over a larger area. The type of noise present in data determines the optimal value for , which is usually unknown. There is a practical consideration of the number of resulting support vectors, even if enough knowledge of the noise is available for selecting an optimal value for (Liu et al. 2006). In the GA-SVM hybrid algorithm, GA automatically starts finding the mentioned parameters of SVM and provides the optimal values, while determining the optimal values of parameters in the SVM algorithm was done by trial-and-error process. The cross-validation, which is an improved version of the grid search method, described by Hsu et al. (2010), was used to find these three parameters. In ν-fold cross-validation, after the training set was divided into ν subsets of equal size, one subset is tested sequentially by applying the classifier trained on the remaining ν − 1 subset. Therefore, each instance of the whole training set is estimated once so the cross-validation accuracy is the percentage of correctly classified data. The general flowchart of GA-SVM is illustrated in Figure 4.
In the present study, SVM and GA-SVM were applied by using RBF kernel function and input variables. Table 2 shows that all parameters used in this study have a right-skewed distribution. On the other hand, according to Figure 5, there is an abundance of outliers in the target and input parameters except and . Those observations which are uncommon and do not conform to the pattern of the majority of the data are called outliers (Rousseeuw & Van Zomeren 1990). The existence of outliers can cause increased error rates and reduce the accuracy of prediction. It can also lead to considerable distortions of statistic estimates when using either parametric or nonparametric tests (Zimmerman 1994, 1995, 1998). One of the simplest methods to tackle this problem is logarithmic transformations of parameters individually or collectively (Hubert & Van der Veeken 2008). Therefore, to reduce the negative effects of skewness and outliers on modeling, the whole dataset had been transformed into logarithmic scale and then the logarithmic parameters were used for modeling.
Model evaluation
RESULTS AND DISCUSSION
For estimating TMC by using SVM, as was mentioned before, we first need to find the optimal values of three adjustable parameters of SVM (, and ). During the grid search, all combinations of were tested for each cross-validation routine, where these parameters all ranged from 0 to 120. Finally, the optimum values of these three parameters were determined by using both GA and grid search algorithms. These values are presented in Table 4. According to Table 4, although both GA and grid search algorithms estimate parameter C to be approximately the same, their estimations were different for the other two parameters. It should be noted that GA does not estimate the optimal value of each parameter separately. This algorithm estimates only the optimal combination of the three parameters.
Models . | Method . | . | . | . |
---|---|---|---|---|
GA-SVM | GA | 3.01 | 0.15 | 0.47 |
SVM | Grid Search | 3.00 | 0.01 | 1.00 |
Models . | Method . | . | . | . |
---|---|---|---|---|
GA-SVM | GA | 3.01 | 0.15 | 0.47 |
SVM | Grid Search | 3.00 | 0.01 | 1.00 |
The performances of SVM, GA-SVM, and the previous methods in TMC estimation by using the mentioned statistical indexes are presented in Table 5.
Models . | (DR < −0.15) . | (−0.15 < DR < 0) . | (0 < DR < 0.15) . | (0.15 < DR) . | Accuracy% . | MAE . | RMSE . |
---|---|---|---|---|---|---|---|
Fischer & Park (1967) | 15.086 | 9.052 | 19.397 | 56.466 | 28.448 | 0.228 | 0.270 |
Yotsukura et al. (1970) | 2.155 | 1.724 | 6.466 | 89.655 | 8.190 | 0.588 | 0.626 |
Chau (2000) | 19.397 | 12.931 | 52.586 | 15.086 | 65.517 | 0.180 | 0.255 |
Ahmad (2007) | 25.431 | 28.017 | 41.810 | 4.741 | 69.828 | 0.169 | 0.273 |
Jeon et al. (2007) | 12.931 | 13.362 | 31.034 | 42.672 | 44.397 | 0.188 | 0.233 |
Azamathulla & Ahmad (2012) | 31.034 | 31.466 | 35.345 | 2.155 | 66.810 | 0.180 | 0.287 |
Aghababaei et al. (2017) | 12.069 | 37.931 | 42.672 | 7.328 | 80.603 | 0.096 | 0.148 |
Zahiri & Nezaratian (2020) | 11.638 | 31.466 | 44.397 | 12.500 | 75.862 | 0.113 | 0.149 |
GA-SVM (Train) | 5.747 | 42.529 | 50.000 | 1.724 | 92.529 | 0.066 | 0.107 |
GA-SVM (Test) | 10.345 | 32.759 | 50.000 | 6.897 | 82.759 | 0.097 | 0.139 |
SVM (Train) | 5.747 | 42.529 | 48.851 | 2.874 | 91.379 | 0.044 | 0.096 |
SVM (Test) | 12.069 | 32.759 | 48.276 | 6.897 | 81.034 | 0.097 | 0.152 |
Models . | (DR < −0.15) . | (−0.15 < DR < 0) . | (0 < DR < 0.15) . | (0.15 < DR) . | Accuracy% . | MAE . | RMSE . |
---|---|---|---|---|---|---|---|
Fischer & Park (1967) | 15.086 | 9.052 | 19.397 | 56.466 | 28.448 | 0.228 | 0.270 |
Yotsukura et al. (1970) | 2.155 | 1.724 | 6.466 | 89.655 | 8.190 | 0.588 | 0.626 |
Chau (2000) | 19.397 | 12.931 | 52.586 | 15.086 | 65.517 | 0.180 | 0.255 |
Ahmad (2007) | 25.431 | 28.017 | 41.810 | 4.741 | 69.828 | 0.169 | 0.273 |
Jeon et al. (2007) | 12.931 | 13.362 | 31.034 | 42.672 | 44.397 | 0.188 | 0.233 |
Azamathulla & Ahmad (2012) | 31.034 | 31.466 | 35.345 | 2.155 | 66.810 | 0.180 | 0.287 |
Aghababaei et al. (2017) | 12.069 | 37.931 | 42.672 | 7.328 | 80.603 | 0.096 | 0.148 |
Zahiri & Nezaratian (2020) | 11.638 | 31.466 | 44.397 | 12.500 | 75.862 | 0.113 | 0.149 |
GA-SVM (Train) | 5.747 | 42.529 | 50.000 | 1.724 | 92.529 | 0.066 | 0.107 |
GA-SVM (Test) | 10.345 | 32.759 | 50.000 | 6.897 | 82.759 | 0.097 | 0.139 |
SVM (Train) | 5.747 | 42.529 | 48.851 | 2.874 | 91.379 | 0.044 | 0.096 |
SVM (Test) | 12.069 | 32.759 | 48.276 | 6.897 | 81.034 | 0.097 | 0.152 |
Along with MAE, RMSE, and accuracy indexes, the balance between overestimation and underestimation values is also another important point in analyzing the models' performances. According to Table 5, among the previous regression models, the two models of Yotsukura et al. (1970) and Fischer & Park (1967), had the lowest performances in estimating the TMC with the accuracy of 8% and 28.5%, respectively. The two models of Aghababaei et al. (2017) and Zahiri & Nezaratian (2020) were able to have accurate performances in estimating TMC. The model of Aghababaei et al. (2017), based on GPSR method, with an accuracy of 80% and RMSE and MAE values of 0.148 and 0.096, respectively, and the simple data-driven-based model proposed by Zahiri & Nezaratian (2020) with a relatively good accuracy (75.8%) and the balance between overestimation and underestimation values were the most accurate regression-based models available to estimate this coefficient. Both GA-SVM and SVM algorithms had genuinely accurate and relatively similar performances. In the testing stage, both of them had the least error rates and the highest accuracy compared to the previous regression-based models. It should also be noted that although both models were based on the SVM algorithm, GA-SVM compared to SVM was able to improve the accuracy of the TMC estimation gently, in both training and testing stages by 1.15% and 1.7%, respectively. On the other hand, the grid search method is more time-consuming than GA, which make the GA-SVM model chosen for estimating TMC in this study. A comparison of the DR values of all expressions along with developed SVM and GA-SVM models is demonstrated in Figure 7. In addition, Figure 8 shows the performances of the developed SVM and GA-SVM in estimating the TMC for the two training and testing stages.
Based on Figure 7, the superiority of GA-SVM and SVM performance is obvious and both models have lower overestimation and underestimation values than the models of Aghababaei et al. (2017) and Zahiri & Nezaratian (2020). In addition, in Figure 8, the estimating accuracy by SVM and GA-SVM models are shown in training and testing stages, separately. The dataset used in this study included characteristics of straight and meandering streams. According to Table 6, the performance of both SVM and GA-SVM models in both straight and meandering streams was more accurate than the regression-based models. All models performed better in estimating the TMC in straight streams than meandering ones.
Models . | Straight . | Meandering . | ||||
---|---|---|---|---|---|---|
Accuracy% . | MAE . | RMSE . | Accuracy% . | MAE . | RMSE . | |
Aghababaei et al. (2017) | 85.246 | 0.082 | 0.124 | 63.265 | 0.150 | 0.216 |
Zahiri & Nezaratian (2020) | 86.339 | 0.089 | 0.115 | 36.735 | 0.200 | 0.235 |
GA-SVM | 93.443 | 0.063 | 0.099 | 77.551 | 0.113 | 0.164 |
SVM | 91.803 | 0.049 | 0.098 | 77.551 | 0.083 | 0.155 |
Models . | Straight . | Meandering . | ||||
---|---|---|---|---|---|---|
Accuracy% . | MAE . | RMSE . | Accuracy% . | MAE . | RMSE . | |
Aghababaei et al. (2017) | 85.246 | 0.082 | 0.124 | 63.265 | 0.150 | 0.216 |
Zahiri & Nezaratian (2020) | 86.339 | 0.089 | 0.115 | 36.735 | 0.200 | 0.235 |
GA-SVM | 93.443 | 0.063 | 0.099 | 77.551 | 0.113 | 0.164 |
SVM | 91.803 | 0.049 | 0.098 | 77.551 | 0.083 | 0.155 |
Sensitivity analysis
Sensitivity analysis helps researchers to determine which parameter has the most effect on reducing output uncertainty, and/or which parameters are negligible and can be eliminated from the final model (Nezaratian et al. 2018). In this study, a sensitivity analysis method was applied to determine the effect of each parameter on the performance of GA-SVM as the most accurate model in the TMC estimation. Five scenarios of the input parameter combinations were introduced to the GA-SVM algorithm for the TMC estimation. Table 7 presents the combination of inputs, absent parameters, SVM parameters, and the performance of each scenario in the testing stage, respectively.
Scenario . | Inputs . | Absent . | Parameters ( . | Accuracy% . | MAE . | RMSE . | Δ;Accuracy% . |
---|---|---|---|---|---|---|---|
1 | U/U*, Fr, Sn | W/H | 7.75, 0.11, 0.30 | 84.483 | 0.064 | 0.110 | 1.725 |
2 | W/H, Fr, Sn | U/U* | 5.47, 0.27, 0.20 | 86.207 | 0.074 | 0.131 | 3.448 |
3 | W/H, U/U*, Sn | Fr | 4.38, 0.19, 0.25 | 89.655 | 0.062 | 0.117 | 6.896 |
4 | W/H, U/U*, Fr | Sn | 2.33, 0.33, 1.59 | 81.034 | 0.087 | 0.124 | −1.725 |
5 | W/H, Sn | U/U*, Fr | 3.50, 0.47, 0.67 | 91.379 | 0.071 | 0.137 | 8.620 |
Scenario . | Inputs . | Absent . | Parameters ( . | Accuracy% . | MAE . | RMSE . | Δ;Accuracy% . |
---|---|---|---|---|---|---|---|
1 | U/U*, Fr, Sn | W/H | 7.75, 0.11, 0.30 | 84.483 | 0.064 | 0.110 | 1.725 |
2 | W/H, Fr, Sn | U/U* | 5.47, 0.27, 0.20 | 86.207 | 0.074 | 0.131 | 3.448 |
3 | W/H, U/U*, Sn | Fr | 4.38, 0.19, 0.25 | 89.655 | 0.062 | 0.117 | 6.896 |
4 | W/H, U/U*, Fr | Sn | 2.33, 0.33, 1.59 | 81.034 | 0.087 | 0.124 | −1.725 |
5 | W/H, Sn | U/U*, Fr | 3.50, 0.47, 0.67 | 91.379 | 0.071 | 0.137 | 8.620 |
As presented in Table 7, the effect of eliminating each input parameter on accuracy of final GA-SVM model was determined. In the table above, ΔAccuracy% expresses the difference between the final accuracy of each scenario and the overall accuracy in the testing stage. It should be noticed that the above method significantly depends on the mathematical and theoretical structure of GA-SVM and may not be able to introduce the most effective parameter on TMC. However, analyzing Table 7 could help us, to some extent, on the effect of each input parameter on TMC estimation. The logic of input combination in scenario 5 was based on Figure 2. According to this figure, and have the highest correlation with the dimensionless parameter of the TMC while the lowest correlation belongs to and , respectively. Therefore, scenario 5 was used to measure the impact of removing the least correlated parameters on modeling TMC by GA-SVM. According to Table 7, in scenario 1, by eliminating from the input parameters, the accuracy increases by 1.725%. However, in scenario 2, when was replaced with in the input variables, the accuracy was improved by 3.488%. In addition, using the same analysis and considering scenario 3, it can be deduced that is the least effective parameter on TMC estimation by using the GA-SVM algorithm. According to scenario 4, it can also be concluded that is a most efficient parameter in the process of modeling TMC. In scenario 5, only inputs which had a correlation coefficient above 0 were used, so and were eliminated from the process. The result showed that there was a significant improvement in the accuracy of the final model, which increase the modeling accuracy by 8.26%. Table 7 demonstrated that reducing the number of input variables with low correlation with the target improved the performance of the final GA-SVM model. Eliminating the low correlated input variables could decrease the complexity of the modeling process and increase the accuracy. This finding agreed with the results of Zahiri & Nezaratian (2020) and Jeon et al. (2007), which showed that and are the most influential parameters in estimating the TMC, respectively.
CONCLUSION
In this study, SVM and GA-SVM algorithms were developed to estimate the transverse mixing coefficient that plays an important role in modeling the pollutant release into streams. For this purpose, three statistical indexes (accuracy, RMSE, and MAE) were used to determine the performance of different models. The results showed the superiority of the proposed model compared to well-known regression-based models. The results also showed that the two models proposed by Aghababaei et al. (2017) and Zahiri & Nezaratian (2020) had the highest accuracy in estimating the TMC, respectively. Dividing the dataset into two groups (straight and meandering streams) showed that SVM and GA-SVM are still more reliable than the previous models. In this study, the grid search method was used to develop the SVM algorithm and was much more time-consuming than the GA algorithm. Therefore, the GA-SVM model was chosen as the best model to estimate the TMC in streams. Then, a sensitivity analysis was performed to determine the most effective input parameters in estimating the TMC by GA-SVM. Based on the sensitivity analysis, and had the least impact on GA-SVM performance in estimating TMC, and eliminating these two parameters improved the accuracy of the TMC estimation.
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories (https://data.mendeley.com/datasets/2mm7jmp2g5/1).