Local attribute-similarity weighting regression algorithm for interpolating soil property values

: Existing spatial interpolation methods estimate the property values of an unmeasured point with observations of its closest points based on spatial distance (SD). However, considering that properties of the neighbors spatially close to the unmeasured point may not be similar, the estimation of properties at the unmeasured one may not be accurate. The present study proposed a local attribute-similarity weighted regression (LASWR) algorithm, which characterized the similarity among spatial points based on non-spatial attributes (NSA) better than on SD. The real soil datasets were used in the validation. Mean absolute error (MAE) and root mean square error (RMSE) were used to compare the performance of LASWR with inverse distance weighting (IDW), ordinary kriging (OK) and geographically weighted regression (GWR). Cross-validation showed that LASWR generally resulted in more accurate predictions than IDW and OK and produced a finer-grained characterization of the spatial relationships between SOC and environmental variables relative to GWR. The present research results suggest that LASWR can play a vital role in improving prediction accuracy and characterizing the influence patterns of environmental variables on response variable. Local attribute-similarity weighting regression algorithm for interpolating soil property values. Int Agric Eng,


Introduction
Spatial interpolation is an important field in spatial analysis and is used in many cases where interested property values at some points are required but not measured [1][2][3] . Conceptually, spatial interpolation is a process that estimates the interested property values at The proximity among spatial points is generally assessed using spatial distance (SD), commonly defined as Euclidean distance based on their geographical coordinates [4,5] .
SD is widely used to search the k-nearest neighbors for unmeasured spatial point whose property values are to be estimated with spatial interpolation methods. Common used interpolation methods include OK [1] , GWR [6] , regression kriging (RK) [7] , local RK [8,9] and geographically weighted regression kriging (GWRK) [10] . These methods are also 96 September, 2017 Int J Agric & Biol Eng Open Access at https://www.ijabe.org Vol. 10 No.5 based on the aforementioned assumption, which is applicable to estimation of property values with good spatial dependency. For properties with less spatially dependent, other factors may play a more important role than SD in determining the interested properties value of spatial points. In other words, the property values among spatial points may not be similar even though they are spatially close to each other [11][12][13] . Thus, the estimation of soil property at A 0 with its three spatially closer neighbors of A 1 -A 3 is less accurate and reliable than estimation made with A 4 -A 6 . This illustrates that spatial distance is not good choice to characterizing the similarity between a pair of points under complex geographical environments.
Note: A 0 represents the soil location to be estimated, and A 1 -A 6 for six soil measured locations. The points of A 1 -A 3 are more spatially close, and points of  is expected to provide more accurate predictions than based on SD. The following sections will discuss the LASWR algorithm in detail.  (2) and (3) are the key steps in LASWR and are discussed below.

Search k nearest neighbors based on NSA
Given n number of measured spatial points and each of unmeasured points x 0 , LASWR calculated all distances of x 0 to the n measured spatial points based on NSA and selected the k number of measured points closest to x 0 as its k-nearest neighbors. The resulted k nearest neighbors was used to estimate the property values at x 0 .
LASWR used NSA to measure the similarity between any two spatial points in their non-spatial attribute space. Given an unmeasured spatial point x 0 and any measured spatial point x i , NSA was defined as the Euclidean distance of the point x 0 to x i in non-spatial attribute space rather than in geographical space as follows: where, P(x 0 ) is the row vector of the predictors at x 0 ; The unmeasured spatial points are defined as spatial objects whose spatial attributes and predictors are known, and whose response variables are yet to be predicted.
Considering that not all environmental variables are related well with the response variable, step-regression method with forward selection was used to reduce the variables not correlating well with the targeted response variable in the paper.

Estimate regression coefficients
In LASWR, the attribute value 0 ( ) z x at an unmeasured spatial point 0 x is estimated as follows: where, β 0 (x 0 ), β j (x 0 ), P j (x 0 ) and m are the regression intercept coefficient, the regression coefficient, the value of the j th predictor at the point x 0 , and the number of the predictors at x 0 , respectively. , then Equation (2) can be rewritten as follows: In fact, the regression coefficient β(x 0 ) is generally unknown, and the estimation 0 ( ) x β of β(x 0 ) depends on the k nearest neighbors of x 0 in non-spatial attribute space.
Thus, LASWR first performed a k-nearest neighbor query to estimate β(x 0 ). The regression coefficient β(x 0 )was estimated by using a weighted least squares technique to minimize the weighted sum of residual squares (WS) as follows: Differentiating with regarding to β in Equation (4), the unique solution can be obtained in the matrix notation from Equation (5).
where, P is a k×(m+1) predictor matrix of the k neighbors around the location x 0 ; W(x 0 ) is a k×k diagonal matrix whose off-diagonal elements are zero and diagonal elements denote the weighting of the neighbors for the location x 0 ; Z is the column vector of response variables of the k neighbors around location x 0 . As a result, the estimation 0 ( ) z x at location x 0 is as follows: where, P 0 is the m+1 vector of m predictors at x 0 .

Construct weighting function
In LASWR, the weights are in accordance with the closeness of the neighbors to the point x 0 in non-spatial attribute space. The closer the neighbors to x 0 are, the higher their weights are. Various weighting functions were discussed and used to calculate their weights [14] . A Gaussian kernel function was used in the present study to calculate the weights. The weight W(x i 0 ) of the i th neighbor for the point x 0 is calculated in Equation (7).
where, d i0 is the Euclidean distance between the point x 0 and its i th neighbor in non-spatial attribute space rather than in geographical space and calculated as Equation (1), and α is the kernel bandwidth for x 0 .
The bandwidth α can be simply considered as the radius of this influence region, and the parameter α controls the kernel bandwidth at location x 0 [14] .

Data acquisition and pre-processing
In this research, the real soil data sets were collected from two study areas for evaluating the performance of LASWR. These two study area were Pantang area (PTA) in Taoyuan  to the data of SOC and TSN before applying the prediction models. The pre-processing of spatial data was carried out with ArcGIS9.0 software.

Model calibration and validation
To assess the performances of NSA and SD, the observed datasets containing SOC or TSN in the two study areas were randomly split into two parts: 75% as training data and 25% as validation data, respectively.
The mean similarity index (MSI) was calculated to assess how close the SOC or TSN value at each location in validation dataset was to ones of their neighbors in training data in Equation (8): where, k is the given number of the nearest neighbors; p is the number of validation samples; Z j is the observed value of the j th validation sample, and Z ij is the value of its i th nearest neighbor of the j th validation sample.
Generally, the smaller the MSI value is, the more similar their soil properties (SOC or TSN) are.
The ten-fold cross-validation method was used to evaluate the prediction performance of the four models (IDW, OK, GWR and LASWR). The two real data sets were randomly split into ten equally sized sub-datasets, respectively. For each round, nine of the 10 sub-datasets were used as training data and the rest one sub-dataset was used as validation data to test the models. The process was repeated 10 times with each of the 10 subsamples used exactly once as the validation data.
In order to improve the reliability of the validation for a given method, the parameters that obtain the minimum root mean square error (RMSE) value are considered optimal when using leave-one-out cross-validation [15] .
Leave-one-out cross-validation was first performed to obtain the optimal parameters of the four methods for the observed data for both SOC and TSN. These optimal parameters were then applied in cross-validation.
The performance of IDW, OK, GWR and LASWR were evaluated by determining how close the predicted value 0 ( ) z x is to the measured value at each location x i . Two indices were used to evaluate the performance: mean absolute error (MAE) and RMSE, defined as follows: where, n, ˆ( )

Programming and implementation
The four methods of IDW, OK, GWR and LASWR were programmed and implemented using Matlab9.0 in Windows XP. For OK, the software package of mGstat was used to obtain the parameters for the semivariogram. mGstat provides an interface for Gstat, which is commonly used for geostatistical modeling.
The interface makes it straightforward to call Gstat using Matlab scripting language. The four datasets of the interpolated SOC across the JJW by IDW, OK, GWR and LASWR were firstly all saved in a excel format, and finally produced to raster maps with ArcGIS9.0 software.

Performance comparison of NSA with SD
To compare the performance of NSA and SD, The MSI values of sampled locations was computed in the PTA and JJW study areas. The smaller the MSI value was, the larger the similarity between the observations for a soil sample and its k nearest neighbors was.
Given a value of k, the MSI value was computed using Equation (8). The results are shown in Figure 3,     OK, GWR and LASWR were illustrated in Figure 4.
The maps of the regression coefficients between SOC and the co-variables by LASWR and GWR were presented in   The regression coefficients by LASWR and GWR spatially varied across the area of PTA ( Figure 6). The regression coefficients between SOC and elevation by LASWR (Figures 6d-6f) presented better spatial patterns of the influencing of elevation and slope on SOC than GWR (Figures 6a-6c) in the case of PTA. The above experimental results demonstrated that LASWR characterized spatial distributions of regression coefficients between SOC and co-variables better than GWR.

Conclusions
The proposed LASWR method based on NSA generally achieved better prediction performance, compared to the methods based on SD (IDW, GWR and OK). One major advantage of LASWR is that it can produce more fine-grained maps that depict the relationships between environmental co-variables and soil conditions. This can assist in better understanding how environmental factors influence soil conditions. Therefore, LASWR can play a vital role in improving the prediction accuracy and reflecting the influencing patterns of environmental variables on soil conditions.
Considering that the results were obtained on the datasets of SOC and TSN at the landscape and watershed scales, datasets of different soil conditions at larger spatial scales may be necessary to further validate NSA and LASWR.