^{1}

^{*}

^{2}

Characterization of unknown groundwater contaminant sources in terms of location, magnitude and duration of source activity is a complex problem. In this study, to increase the efficiency and accuracy of source characterization an alternative methodology to the methodologies proposed earlier is developed. This methodology, Adaptive Surrogate Modeling Based Optimization (ASMBO) uses the capabilities of Self Organizing Map (SOM) algorithm to design the surrogate models and adaptive surrogate models for source characterization. The most important advantage of this methodology is its direct utilization for groundwater contaminant characterization without the necessity of utilizing a linked simulation optimization model. The validation of the SOM based surrogate models and SOM based adaptive surrogate models demonstrates that the quantity and quality of initial sample sizes have crucial role on the accuracy of solutions as the designed monitoring locations. The performance evaluation results of the proposed methodology are obtained using error free and erroneous concentration measurement data. These results demonstrate that the developed methodology could approximate groundwater flow and transport simulation models, and substitute the optimization model for characterization of unknown groundwater contaminant sources in terms of location, magnitude and duration of source activity.

Groundwater has a fundamental role in human life as being one of the main renewable sources of fresh water. Unfortunately, in recent decades, because of increasing anthropogenic activities and improper management worldwide, ground- water is subjected to several kinds of pollutants such as seepage from: chemical and petrochemical infrastructure; waste water collection systems; industrial, mining and agriculture fields. However, usually groundwater contamination remains undetected for a long time and is often detected accidently by changing qualities of regional surface water or by chemical analysis of water collected from drinking water wells. Therefore, identifying the unknown characteristics of these contaminant sources and remediation of contaminated groundwater is a necessity. On the other hand, identifying unknown groundwater contaminant source characteristics (contaminant magnitudes, locations and time releases) usually are time consuming and inaccurate because of the uncertainties in the available hydrogeologic information and sparsity of measurement data. Also, the solutions may be non-unique because of high sensitivity to the monitoring data and model parameters. The methodologies proposed earlier to identify unknown groundwater contaminant characteristics can be classified into two major groups: methods based on statistical estimation, and methods based on optimization approaches. An extensive literature review of these methodologies can be found in [

Therefore, Surrogate Modeling Based Optimization (SMBO) methodologies have been proposed to reduce these enormous computing costs and time associated with repeated runs of the numerical simulation models within the optimization algorithm. Surrogate models based on ANN, GA, Kriging, and regression techniques have been proposed as approximate simulators of the physical processes [

In this study, the numerical simulation model MODFLOW is utilized to simulate groundwater flow process in a contaminated aquifer. The governing equation in this numerical simulation model can be represented by Equation (1). This equation describes three-dimensional movement of groundwater in non- equilibrium, anisotropic and heterogeneous conditions [

where:

h is the potentiometric head (L);

W is a volumetric flux per unit volume from aquifer as sources (sinks), the negative value represents withdrawal of the groundwater system and vice versa (T^{−1});

SS is the specific storage of the porous media (L^{−1});

t is time (T).

Moreover, for simulating the three dimensional transports of contaminants in groundwater MT3DMS is utilized. The governing equation of MT3DMS can be described by Equation (2), which is a partial differential equation and considers the fate and transport of contaminants of species k in a 3-D, transient groundwater flow system [

where

^{−3};

t is time;

^{2}T^{−1};

^{−1};

^{−1};

^{−3}; and

^{−3}∙T^{−1}.

In this equation, advection, dispersion and chemical reaction of contaminants in groundwater are considered. To solve this equation, the seepage velocity that

is related to the Darcy flux through the relationship

The Self Organizing Map (SOM) is an algorithm introduced by Kohonen to visualize multidimensional data. This algorithm visualizes complex non-linear statistical multidimensional data problems usually into two dimensional display [

The SOM algorithm consists of a set of processing units, “neurons”, which are commonly arranged in a 2-dimensional rectangular or hexagonal grid. These neurons are accompanied with a location and a weight vector that connects input to output by stating an initial random weight in several iterations to reach a stable map. In other words, this algorithm tries to cluster training data based on similarity and topology without any external supervision [

1) Initialization: in this step, it is assumed that the set of input data with N units is represented by X:

2) Competition: for each input pattern Xi, the output neurons compete to declare the winner neuron. The winner neuron or Best Matching Unit (BMU) is the closest neuron or most similar one to the input vector. The discriminant function used for this step can be defined by Equation (3) which is a squared Euclidean distance between the input vector X and weight vector

3) Cooperation: according to the results of neurobiological studies there is a lateral interaction within a set of excited neuron and the winner neuron. This interaction decays with distance. Therefore, the winning neuron and its topological neighbours update all weights according to Equation (4) and are moved to decrease their distance with the input units.

where

4) Adaptation: the excited neurons decrease their discriminant function values to reach an appropriate alignment to the input pattern. For this step, the process repeats steps 2 to 4 until the feature map stops changing.

The SOM algorithm visualizes nonlinear relationship of high dimensional data into low dimensional display by preserving the main characteristics of input data. This algorithm is capable of not only clustering and visualizing high dimensional data but, also is capable of generalization. In other words, SOM can interpolate between the initial data and predict missing values of the system’s vectors [

and visualization.

Surrogate models function essentially by developing a relationship between the inputs and outputs of the system based on training of the model. If this model is constructed accurately, approximates can mimic the behavior of more sophisticated simulation models at substantially reduced computational time [

1) Initial sampling: first, the main variables of the defined system as per their degree of importance, according to the preliminary experiments are chosen [

2) Generating training data: the numerical simulation models are solved to generate solution results for randomly generated initial samples in previous

stage. In this study, the groundwater flow and transport simulation models MODFLOW and MT3DMS (within GMS 7) are solved for randomly generated source fluxes.

3) Construction of surrogate model: in this stage, Self-Organizing Map (SOM) is utilized as the surrogate model type to represent the response surface of the simulation model inputs-outputs values. The other main issue in this stage is how the selected variables are used to design the SOM based surrogate model.

4) Testing and validation: this stage evaluates the potential applicability of the surrogate models. The new randomly generated sample sets that were not used in the training process are utilized in this stage. The results are applicable for modification of the surrogate model type and its design. The performance of the SOM based surrogate model is evaluated for two conditions: first, it is assumed that the contaminant concentration values at specific time and locations are known and the corresponding contaminant source fluxes at specified potential locations at specific time are considered as unknown variables to be estimated. Second, the constructed SOM based surrogate model performance is also tested by estimating spatial and temporal concentration values at specified time and locations, assuming contaminant sources are known.

In this stage, the BMU which has similar definition (Equation (3)) as the implicit objective function of source identification problem is utilized to characterize unknown contaminant sources of testing sample sets as an inverse problem. The implicit objective function of source identification problem is defined to minimize the difference between estimated contaminant concentration values and observed contaminant concentration values at specific monitoring locations at specific time. The main constraints of optimization model are groundwater flow and transport simulation models. In this proposed methodology, the SOM based surrogate models represent approximate simulation of the physical processes. In other words, the obtained BMU of the SOM based surrogate model is utilized to find the unknown characteristics (magnitude, location and duration) of potential contaminant sources, hence eliminating the necessity of using any complex and explicit optimization model.

5) SOM based surrogate model/stage 3: If the solution results are acceptable SOM based surrogate model is selected and it is ready for characterizing unknown contaminant sources as an inverse problem by utilizing BMU; otherwise, go to stage 3 and change the design of surrogate model.

6) Adaptive surrogate model: in this stage, to improve the SOM based surrogate model results, the adaptive sampling strategy is applied. There are several adaptive sampling methods such as: Maximizing Expected Improvement (MEI), Maximizing the Probability of Improvement (MPI) and Minimizing a Statistical Lower Bound (MSL).

All of these three strategies lead the algorithm to go back and find the areas where the samples point are located. However, in this study instead of the mentioned strategies new samples based on obtained results of SOM based surrogate model are added to the initial sample sets. This essentially means that additional training patterns are generated utilizing the latest source characterization estimates. Then the model is re-trained to effectively increase the accuracy of source identification results.

In this study, performance of the developed methodology is evaluated utilizing synthetic hydrogeologic and geochemical data for an illustrative contaminated aquifer. The advantage in using synthetic data is that the unknown data errors in the measurement data can be quantified and need not be treated as unknown quantities for evaluation purpose. Normalized Absolute Error of Estimation (NAEE) is also utilized as a measure to calculate a normalized error of estimation. Equation (5) represents NAEE [

where:

S is the number of pollution source (s);

N is the number of transport stress periods;

The illustrative study area utilized for the performance evaluation of the proposed methodology is a homogeneous aquifer which consists of one confined layer (

Parameter | Unit | Value |
---|---|---|

Maximum length of study area | m | 1000 |

Maximum width of study area | m | 1500 |

Saturated thickness, b | m | 7.6 |

Grid spacing in x-direction | m | 50 |

Grid spacing in y-direction | m | 50 |

Horizontal hydraulic conductivity | m/d | 18 |

Porosity | Dimensionless | 0.25 |

Longitudinal dispersivity | m | 35 |

Ratio: H/L dispersivity | Dimensionless | 0.2 |

Specific yield | Dimensionless | 0.2 |

Confined storage coefficient | Dimensionless | 0.2 |

Initial contaminant flux | Kg/day | 0 - 100 |

Potential contaminant source location (row, column) | Contaminant fluxes (Kg/day) | ||||
---|---|---|---|---|---|

SP1 | SP2 | SP3 | SP4 | SP5 | |

S1 (5, 10) | 0 | 0 | 0 | 0 | 0 |

S2 (6, 13) | 60 | 20 | 45 | 50 | 0 |

S3 (7, 6) | 80 | 58 | 22 | 30 | 0 |

ID | Row | Column | Abstraction rate for each stress period (m^{3}/day) | ||||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | |||

Abstraction well 1 | 10 | 4 | −100.25 | −100.25 | −68 | −16 | −49 |

Abstraction well 2 | 10 | 8 | −100.25 | −80.2 | −96 | −100.25 | −88 |

In this study, SOM based surrogate models and SOM based adaptive surrogate models are utilized to characterize unknown groundwater contaminant sources as an inverse problem. The following steps are followed to select the best SOM based surrogate model among constructed models for illustrative study area; then, the SOM based adaptive surrogate model is developed.

1) Scenarios for initial sampling: LHS is used to randomly generate two groups of 1000 initial sample sets. These sample sets are generated by assuming that all of these three potential sources are active through first four stress periods, SP1 to SP4. Also, three groups of 100 sample sets are generated by assuming that in each group at least one of the sources is inactive. The contaminant source fluxes are assumed to be in the range of 0 - 100 kg/day for all potential sources. For all of the generated sample sets, the three potential contaminant source fluxes at five different stress periods and their corresponding contaminant concentration magnitudes at specified monitoring locations and specific stress periods are selected as the variables of the surrogate models for this study area.

2) Generating training data: the solution results of the numerical simulation models for generated initial sample sets are obtained in this step. The numerical flow and transport simulation models MODFLOW and MT3DMS (within GMS 7) are solved to obtain adequate sample data for training and testing of the surrogate models.

3) Construction of the SOM based surrogate model: in this step, SOM algorithm is utilized to create SOM based surrogate models. It is supposed that if SOM based surrogate models are constructed accurately, these models could properly approximate the groundwater flow and transport simulation models.

Source fluxes (Kg/day) | Contaminant concentration (g/l) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

S1-SP | … | M1-SP | … | |||||||

1 | 2 | 3 | 4 | … | 1 | 2 | 3 | 4 | 5 | … |

42 | 44 | 41 | 97 | … | 0.00 | 0.03 | 0.09 | 0.14 | 0.00 | … |

56 | 73 | 24 | 54 | … | 0.00 | 0.01 | 0.06 | 0.19 | 0.00 | … |

39 | 76 | 74 | 23 | … | 0.00 | 0.06 | 0.13 | 0.15 | 0.00 | … |

80 | 0 | 58 | 39 | … | 0.00 | 0.02 | 0.05 | 0.08 | 0.00 | … |

0 | 0 | 0 | 0 | … | 0.00 | 0.05 | 0.11 | 0.18 | 0.00 | … |

4) Testing and validation of the SOM based surrogate model: the constructed SOM based surrogate models are tested by 100 new random sample sets. The contaminant source fluxes of these sample sets are generated randomly by using LHS method in the range of 0 - 100 kg/day. Then, the corresponding contaminant concentration values at monitoring locations are obtained by utilizing the simulation models. In this stage, different surrogate models representing different numbers of initial sample sizes, and SOM map units are constructed and evaluated. The evaluation results lead to selection of the best candidate SOM based surrogate model among the constructed surrogate models for the illustrative study area.

As mentioned in the methodology section, because the definition of BMU of the SOM algorithm (Equation (3)) is similar to the definition of the implicit objective function of source identification problem. Therefore, the BMU of SOM algorithm is utilized for estimating unknown characteristics (magnitude, location and duration) of potential contaminant sources. This algorithm by using the information of known components of the input vector estimated the unknown components of the input vector. In this study, this capability of the SOM algorithm is utilized to characterize unknown groundwater contaminant sources as an inverse problem. It also utilized to estimate contaminant concentration values at specified location and time when the contaminant sources and their characteristics are known.

For performance evaluation of source characterization capabilities utilizing the trained SOM surrogate models, the contaminant concentration values at monitoring locations at specific times are considered as known variables of an input vector. This vector needs to have the same number of variables as the input vectors of training phase.

Source fluxes (Kg/day) | Contaminant concentration (g/l) | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

S1-SP | S2-SP | S3-SP | M1-SP | … | |||||||||||||

1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 5 | |

0.00 | 0.04 | 0.10 | 0.14 | 0.00 | … | ||||||||||||

0.00 | 0.03 | 0.08 | 0.16 | 0.00 | … | ||||||||||||

0.00 | 0.05 | 0.13 | 0.16 | 0.00 | … | ||||||||||||

0.00 | 0.02 | 0.09 | 0.22 | 0.00 | … | ||||||||||||

0.01 | 0.06 | 0.15 | 0.24 | 0.00 | … | ||||||||||||

models. The contaminant source fluxes for three potential contaminant sources at four stress periods (SP1 to SP4) are assumed as unknown variables. The BMU is utilized to estimate these unknown variables. By searching for the BMU and using the information of known components of the input vector, the most similar vector is recognized. Therefore, missing values of the input vector are estimated.

5) The selected SOM based surrogate model: the selected SOM based surrogate model is used to characterize the unknown groundwater contaminated sources as an inverse problem and for further performance evaluation.

6) SOM based adaptive surrogate model: It is supposed that SOM based adaptive surrogate models could improve the source characterization results. Therefore, based on the preliminary results of the selected SOM based surrogate model (i.e., emphasizing the preliminary or latest source estimation results new sample patterns are randomly generated) the SOM based adaptive surrogate model is constructed for contaminated aquifer by adding new sample sets. 500 new sample sets are generated by utilizing LHS and considering the results obtained by utilizing SOM based surrogate model for source identification.

For evaluating the effect of initial sample sets on the result of surrogate models, different surrogate models using different numbers of initial sample sets ranging 1000 to 2300 are constructed. The concentration measurement data corresponding to 6 existing monitoring locations are used to construct these surrogate models. The numbers of SOM map units are maintained constant (100 × 100 units). The best results are obtained by using 2300 initial sample sets; the average NAEE for 100 sample sets is equal to 30.4 percent. Therefore, 2300 sample sets are used as the selected initial sample sets for constructing SOM based surrogate models with different SOM map units. The 2300 sample sets consist of a subset of 2000 sample sets for which, all of the potential contaminant sources are considered as active sources through SP1 to SP4. Also, it consisted of another subset of 300 sample sets which represent the scenario that in each set at least one of the sources is inactive. The results of this constructed SOM based surrogate model for estimating contaminant concentrations at selected monitoring locations is shown in

Different SOM based surrogate models representing different numbers of SOM map units are also constructed. In these scenarios, the number of monitoring locations and the number of initial sample sets are maintained constant at 6 and 2300, respectively. The solution results for source identification and estimating contaminant concentration at monitoring locations are presented in

ID | Number of map units | SOM map characteristics | NAEE (%) | |
---|---|---|---|---|

Map shape | Neighborhood function | SOM based Surrogate Model (Substituting groundwater flow and transport simulation models) | ||

1 | 50 × 50 | Rectangular | Gaussian | 43.8 |

2 | 75 × 75 | 31.3 | ||

3 | 100 × 100 | 30.4 | ||

4 | 110 × 110 | 30.9 | ||

5 | 120 × 120 | 30.4 | ||

6 | 130 × 130 | 29.8 |

2300 initial sample sets and 100*100 map units is selected as the best SOM based surrogate model among constructed SOM based surrogate models.

The developed SOM based surrogate models could approximate the groundwater flow and transport simulation models. These outcomes are achieved according to the solution results obtained at model evaluation and model testing stages. The solution results presented earlier lead to the selection of the most suitable surrogate model among the constructed surrogate models for the illustrative study area. This model is constituted of 100 × 100 SOM map units that utilized the 2300 initial sample sets. These 2300 random initial sample sets used the information from three potential contaminant sources and the corresponding contaminant concentration at 6 existing monitoring locations. The obtained solution results for contaminated study area by utilizing the measured contaminant concentration values at 6 existing monitoring locations are illustrated in

The obtained results are not entirely satisfactory and the NAEE is equal to 31 percent. However, the obtained results in this stage demonstrate that the S1 is an inactive source. This result also achieved by other constructed SOM based surrogate. Therefore, in order to improve the accuracy of results, it may be necessary to incorporate new samples, and possibly construct a SOM based adaptive surrogate model for unknown groundwater contaminant source identification. 500 new sample sets are generated by utilizing LHS and considering that S1 is an inactive source. The solution results for SOM based adaptive surrogate models are illustrated in

solution results of SOM based adaptive surrogate models increases by 11 percent when compared to the results obtained using the previously selected SOM based surrogate model.

Moreover, for continuing the evaluation of the performance of the developed SOM based adaptive surrogate model and the previously selected SOM based surrogate model, synthetic erroneous concentration measurements data are utilized for evaluation purpose. For this purpose, simulated contaminant concentrations are perturbed with varied amounts of random errors, i.e., 5, 10, 15, 20, 25 and 30 percent of simulated values. The simulated contaminant concentrations measurements at monitoring locations are assumed to incorporate 5, 10, 15, 20, 25 and 30 percent random errors. The following equation is utilized for synthetically generating the perturbed concentration measurement values with random errors [

where

a is maximum deviation expressed as a percentage; and

b is a random fraction between +1 and −1 obtained by utilizing the LHS.

The source characterization results obtained with these erroneous concentration measurements are shown in

These solution results shown in

The performance evaluation results of the SOM based surrogate model are not entirely satisfactory. These very limited results show that it could approximate groundwater flow and transport simulation models properly. However, for increasing the efficiency of developed methodology additional training with incorporation of different actual source location scenarios were developed. The evaluation results also indicated that the quantity and quality of initial sample sets and the number of SOM map units have a crucial rule in the efficiency of the

model (

1) Using the concentration data at designed monitoring locations, designed for improving source characterization [

2) Exploring other methods to generate initial random sample sets;

3) Utilizing optimal number of variables in the designing of surrogate models by selecting only those available monitoring locations which affect the accuracy of identifying pollution sources; and

4) Applying sequential sampling method as in SOM based adaptive surrogate models by considering the previous stage results.

It can be concluded that, SOM based surrogate model and SOM based adaptive surrogate model could be utilized to identify unknown characteristics of potential contaminant source in contaminated aquifers. Also, these could be applied to estimate the contaminant concentration values at specified monitoring location if the contaminant sources are known. Especially, additional information based on earlier estimates of the contaminant source characteristics scenarios if incorporated in the training stage; it can increase the efficiency in terms of more accurate estimation when new samples are added. This is essentially the adaptive surrogate model based optimization approach. One of the advantages of this methodology is the consistency of solution results for ideal (error free concentration measurements) and real (when contaminant concentration incorporate up to 20 percent erroneous data) scenarios. This observation may be relevant only when limited numbers of initial samples are utilized. Therefore, the selected method to generate relevant initial sample sets has important role on the solution results. Also, utilizing sufficient size of sample sets is necessary.

Different scenarios correspond to different surrogate models with various numbers of initial sample sizes and Self-Organizing Map (SOM) map units are considered. Also, the performance of the developed methodology is evaluated by utilizing the SOM based surrogate model, to identify potential contaminant sources, for an ideal scenario of error free concentration data, as well as scenarios with different degrees of erroneous concentration measurements data. In addition, an improved version of SOM based surrogate model, i.e. SOM based adaptive surrogate model (ASMBO) is constructed to characterize potential contaminant sources. Main conclusions that can be drawn from these limited performance evaluation results are:

1) SOM based surrogate models are potentially efficient methods to approximate groundwater flow and transport simulation models. The developed methodology can be used as an alternative methodology for unknown groundwater contaminant sources characterization, which can potentially eliminate the necessity of using other widely used methodologies, i.e., the linked simulation optimization methodology.

2) The quality of initial sample size is important. This size should be adequate and cover the whole plausible range of contaminant source fluxes for all the potential contaminant sources.

3) The size of SOM map units is important. The best size should be selected due to the memory of PC used, number of variables, and initial sample sizes.

4) The performance evaluation results do show comparatively large errors in terms of the specific error criteria utilized. However, a comparison of the source estimates and the actual source characteristics shows a good match.

5) Most important conclusion is that the SOM based surrogate models may provide a feasible methodology for characterization/identification of unknown groundwater contaminant sources in terms of location, magnitude and duration of source activity, without the necessity of using a linked simulation optimization model, when the ASMBO methodology is adopted. However, it appears likely that the accuracy of characterization may not be adequate in real life scenarios with multiple sources, complex hydrogeology of the aquifer, and parameter estimation uncertainties.

6) The SOM based models seem to perform satisfactorily when concentration measurement data are erroneous.

7) The performance evaluation results presented in this study are very limited in scope and more rigorous evaluations are necessary to establish its applicability for source identification without using any optimal decision model. These performance evaluation results are based on very limited scenarios. More rigorous performance evaluations incorporating: random heterogeneity of hydrogeologic parameters and considering more complex geochemical processes are necessary to establish the applicability of the proposed methodology.

The second author thanks CRC-CARE, Australia for providing financial support for this research through Project No. 5.6.0.3.09/10(2.6.03), CRC-CARE-Bithin Datta which partially funded the Ph.D. scholarship of the first author.

Hazrati-Yadkoori, S. and Datta, B. (2017) Adaptive Surrogate Mo- del Based Optimization (ASMBO) for Unknown Groundwater Contaminant Source Characterizations Using Self-Organizing Maps. Journal of Water Resource and Protection, 9, 193-214. https://doi.org/10.4236/jwarp.2017.92014