Document Type : Original Article
Authors
1
Department of Soil Science, Faculty of Agriculture, University of Zanjan, Zanjan, Iran
2
Soil and Water Research Institute, Agricultural Research, Education and Extension Organization, Karaj, Iran
10.48308/envs.2024.1433
Abstract
Introduction: Digital soil mapping using innovative machine learning methods is increasingly used to predict the spatial distribution and various soil properties. However in soil science studies, the use of digital soil mapping methods faces challenges due to the imbalance in soil classes, which negatively affects the performance of machine learning algorithms. Therefore, this study aims to address this challenge by improving the classification of imbalanced soil classes through two approaches: resampling and cost-sensitive learning, using the random forest prediction model in Zanjan Province.
Material and Methods: A number of 148 soil samples were collected based on a random classification pattern with a 500 meter spacing and subjected to various physical and chemical analyses in the laboratory following standard methods. Environmental covariates included geomorphological and geological maps, digital elevation model (DEM), and Landsat 8 satellite images, which were selected as inputs for soil class prediction based on expert opinion and principal component analysis (PCA). Some environmental covariates, such as geomorphological and geological maps information and features extracted from DEM, were identified as the most effective predictors for soil classes and were chosen as model inputs. Analytical hill shading (AHS), sunrise, valley depth, LS_factor, channel network distance (CND), topographic wetness index (TWI) and multi-resolution ridge top flatness index (MRRTF) were selected as the most effective environmental variables and modeled the most spatial variability of the soils of the region. Soil-landscape relationship modeling was done performed using Random Forest algorithm and correcting imbalanced data was done by resampling approach using ubOver and ubUnder functions and also by cost-sensitive learning approach using rf function in Random Forest package in Rstudio software environment.
Results and discussion: Soil subgroups were classified into five imbalanced classes, including Typic Calcixerepts, Typic Haploxerepts, Gypsic Haploxerepts, Typic Xerorthents, and Lithic Xerorthents. The validation results showed that the overall accuracy (OA) and kappa coefficient for evaluating the soil map with imbalanced data were 65% and 0.32, respectively. After data balancing through resampling, these values increased to 71% and 0.54, respectively, and in the cost-sensitive learning approach, they reached 86% and 0.77, respectively. Gypsic Haploxerepts and Lithic Xerorthents subgroups, considered minority classes, were unidentified and excluded when using imbalanced classes. However, after data improvement and augmentation with both resampling and cost-sensitive learning approaches, the prediction of these two minority classes demonstrated acceptable accuracy improvements.
Conclusion: The results of the evaluation of the models showed that in modeling using an unbalanced distribution of soil classes, due to the loss of classes with a small number of observations, uncertain maps with relatively poor accuracy are created, and after applying data balancing, the accuracy of models based on soil relationships - Topography is improved in digital soil mapping studies. The results showed that the cost-sensitive learning approach, focusing on classes with low repetition, can be used as a superior model in other areas. Considering that the research in the field of unbalanced soil data is limited, this study can be an effective solution to deal with unbalanced data in soil classes and produce digital soil maps with high accuracy.
Keywords