Depth recovery for unstructured farmland road image using an improved SIFT algorithm

: Road visual navigation relies on accurate road models. This study was aimed at proposing an improved scale-invariant feature transform (SIFT) algorithm for recovering depth information from farmland road images, which would provide a reliable path for visual navigation. The mean image of pixel value in five channels (R, G, B, S and V) were treated as the inspected image and the feature points of the inspected image were extracted by the Canny algorithm, for achieving precise location of the feature points and ensuring the uniformity and density of the feature points. The mean value of the pixels in 5×5 neighborhood around the feature point at an interval of 45º in eight directions was then treated as the feature vector, and the differences of the feature vectors were calculated for preliminary matching of the left and right image feature points. In order to achieve the depth information of farmland road images, the energy method of feature points was used for eliminating the mismatched points. Experiments with a binocular stereo vision system were conducted and the results showed that the matching accuracy and time consuming for depth recovery when using the improved SIFT algorithm were 96.48% and 5.6 s, respectively, with the accuracy for depth recovery of –7.17%-2.97% in a certain sight distance. The mean uniformity, time consuming and matching accuracy for all the 60 images under various climates and road conditions were 50%-70%, 5.0-6.5 s, and higher than 88%, respectively, indicating that performance for achieving the feature points (e.g., uniformity, matching accuracy, and algorithm real-time) of the improved SIFT algorithm were superior to that of conventional SIFT algorithm. This study provides an important reference for navigation technology of agricultural equipment based on machine vision.


Introduction
Intelligent agricultural machinery navigation technology, based on machine vision, has achieved an increasing attention in last decades due to the advantages of rich details of farmland and superior adaptability to complicated environment. Road recognition, coupled with three-dimension (3D) reconstruction is one of the research hotspots. West [1] solved the influence of shadow on image extraction by using color information in road images. Feng et al. [2][3][4] proposed a method for the determination of appropriate driving paths for navigation vehicle by analyzing the color and gray features in different areas. In order to extract the crop center directly under illumination for driving purposes, Bengochea-Guevara et al. [5] used a threshold segmentation method, considering the field contour and crop type. In the study of Bao et al. [6] , a trapezoid prediction model was first used to detect the edge shape characteristics of the road, and then an improved support vector machine (SVM) classifier was utilized to recognize the road connected area. Choi et al. [7] focused on the study of texture, morphology and other geometric features of the work area to overcome the impact of environmental changes on road images. In addition, neural network has also been introduced into the study of route identification and obstacle location [8,9] .
It is important to obtain the sophisticated road 3D model for accurate real-time control of auto-navigation vehicles, and accurate extraction of image features is critical for reconstruction of the farmland road model. Kostavelis et al. [10] obtained the parallax map through a stereo camera and achieved the 3D information of the road surface due to the advantages of repeatability and robustness of the SURF. Vitor et al. [11] analyzed and compared the extraction methods of two urban roads, indicating that the standardized histogram feature descriptor was superior to artificial neural network in the detection of road areas. Li et al. [12] mapped the 3D point cloud onto the 2D image, and described the important features in the 3D scene by using the local descriptors. The Scale Invariant Feature Transform (SIFT) algorithm [13] , possessing the advantage of high robustness, has always been taken as the preferred algorithm for stereo vision and 3D scene reconstruction. Shao et al. [14] proposed a binocular stereo vision calibration method based on the parallel optical axis. In their study, feature matching algorithm was used to extract pixel coordinates of matching points to measure the obstacle distances. Guo et al. [15] calculated the picking points of litchi by SIFT binocular stereo matching based on cosine similarity, and achieved the matching success rate of 89.55%. Tian et al. [16,17] adopted multi-thread SIFT feature point detection method for decreasing the time-consuming of road detection successfully. In order to integrate feature matching underwater acoustic image navigation, Li et al. [18] used SIFT feature matching algorithm for extracting feature points of underwater acoustic images. In order to obtain the position error and yaw error simultaneously by the SAR scene matching assistant navigation system, Ren et al. [19] proposed an image matching algorithm based on SIFT.
Previous studies mainly focused on the use of image processing techniques to identify navigation routes, while less attention has been paid to the microscopic details of farmland roads. Although the speeded up robust features (SURF) and SIFT algorithm have been used for feature extraction and matching of the farmland road, the improvement of real-time performance and matching accuracy is in great demand. The SIFT algorithm is advantageous in keeping constant in scene rotation, scale scaling and brightness change. Moreover, the changes of the rotation and scaling between left and right images can be neglected, when the binocular camera is placed in parallel at a small distance, so the redundant information of rotation and scaling can be abandoned to improve the real-time performance of the SIFT algorithm. Therefore, the objective of this study is to improve and optimize the SIFT algorithm for accurate depth recovery of farmland roads under machine vision conditions. The Canny operator was first used to locate the image feature points, considering the density and uniformity of the feature point distribution. Then the feature vectors of feature points in different color spaces were extracted, which were used for matching the left and right image features. Finally, the mismatched pairs were eliminated using the energy method, after which the depth recovery was achieved. The specific objectives were to: (1) optimize the SIFT algorithm by combining the Canny operator; (2) validate the performance of the improved SIFT algorithm using field experiments.

Materials
A parallel binocular stereo vision system was used in this study, which mainly consists of a pair of Panasonic CCD cameras (WV-CP480/CH), Seiko lenses (SSV0358), an image acquisition card (DH-CG300) and a computer (Core i5, 3.2GHz, 8G). The visual system was mounted on the front of the tractor (Huanghai Jinma 250B). The images were acquired on the farm track of Guantang experimental farm in Zhejiang A&F University, with the size of 640×480 pixels. The calibration and error correction of the visual system were carried out before image acquisition, and MATLAB 2012b (Mathworks, Inc., Natick, Massachusetts, USA) was used to analyze the images. The platform and acquired images of the left and right cameras are shown in Figures 1a, 1b and 1c, respectively.

Image processing
The flowchart of image processing in this study is shown in Figure 2.
The acquired image was first subjected to homogenization processing and feature point location, and then it was judged whether the uniformity of the image met the requirement. If the requirement was met, image feature extraction, feature point matching, and feature point energy calculation were sequentially performed, otherwise it returned to the homogenization process. The images, which met the requirement of energy method, were finally used for depth recovery.  [20,21] , but the algorithm was not well in real-time. RGB and HSV are two common color spaces, and the positions with the pixel values changing sharply are different, resulting in inconsistent distribution uniformity of the extracted feature points. In order to improve the distribution uniformity of the feature points, three single channel data matrices of R, G and B, coupled with the saturation S and the gray level V data matrix in the HSV space, were used in this study. The new inspected image matrix I mean , which was the mean image of pixel value in five channels, could be calculated as follows:

Extracting feature points
The Canny operator is a multi-level edge detection algorithm [22] , which could suppresses noise effectively and determines edge position accurately. In this study, preliminary extraction and positioning for the feature points of the new inspected image were firstly conducted by the Canny operator. Then eigenvalue of the feature points was determined by using the SIFT algorithm. Discrete Gaussian Filter in the Canny operator, taking care of both the main outlines and tiny details of the image, could further improve the uniformity of the feature point distribution. Figure 4a was a binary image for the feature point distribution extracted by the above method. To better express the specific location of these feature points, the feature points were enlarged and displayed in a true color scene image, as shown in Figure 4b. The Canny operator used in this study had a 2D discrete Gaussian 9×9 filter with the standard deviation as 1 and the ratio of low threshold to high threshold as 0.4, resulting in the number of feature points as 3995. Assuming that the number of feature points extracted by the method in Section 2.2 is T. In order to quantitatively analyze its uniformity, the image in Figure 4a was evenly divided into k block regions. Here it was divided into 25 pieces (5×5), with the size of 128×96 pixels for each piece. The uniformity of the feature point distribution was described using a variable U, which could be calculated using Equation (2).
where, N i was the number of feature points of block i, and T was the total number of the feature points.
Since the Equation (2) has been normalized, the closer U value is to 1, the better the uniformity of feature points is. Therefore, the feature point uniformity of the conventional SIFT method and the method proposed in this study can be compared using the value of U. In a picture of the same size, the larger the number of feature points, the higher the density. To ensure that the uniformity of the feature point distribution of the two methods was compared under the same density, a specific threshold was set in the SFIT algorithm to make the number of feature points basically equal [8] . Figures 5a and 5b showed the feature point distribution extracted by the conventional SIFT method and the proposed method under a single channel (R channel), respectively. Table 1 was the corresponding quantitative analysis for the feature point uniformity. It was observed that the feature point uniformity extracted by the method proposed in this study was better than the conventional SIFT algorithm. Moreover, the performance of the mean operation for multi-channel images was superior to that with only a single channel.

Extracting feature value
Feature value is important to serve as the matching basis for depth reconstruction of the farmland road images. The feature vectors extracted by the conventional SIFT algorithm had the characteristics of invariance to scene rotation and scaling. The rotation and scaling changes of the left and right images were negligible when the binocular camera was placed parallel to each other at a small distance. However, the affine transformation between the left and right images could not be ignored due to the parallax between the left and right cameras. In order to improve the real-time performance of the algorithm, the redundant information, such as rotation and scale scaling, was abandoned. Since the feature points of the image were obtained based on the mean values of R, G, B, S and V, the feature values were also extracted in the five channels to describe the feature points. The specific steps were as follows: Step1: Determine the dimension of the feature value. In order to describe each feature point accurately and comprehensively, it is necessary to investigate the information of other points adjacent to this feature point. The neighborhood scope was increased in this study to lower the effect of the affine transformation between images. Therefore, for a feature point f(x,y) in a channel, the pixel values in the neighborhood of 5×5 were analyzed (Figure 6a). The values of v(1)-v (8) in eight directions around the feature point f(x,y) at an interval of 45 degrees were used to represent the feature vector of this point, as shown in Figure 6b. The values of v(1)-v (8) can be calculated using Equation (3): (1) and v(2) were the mean values of the pixels in the corresponding red and blue dotted box in Figure 6a, respectively. In this way, a 5×8 feature vector V was generated for each feature point. The feature vector V was able to restrain image noise and maintain the stability for the affine transformation, since it was calculated as the mean value.  Step2: Determine the initial matching point pair. The Marr Polar Constraint principle was used to narrow the range of potential matching point pairs of left and right images, followed with Equation (4) for calculating the difference between the feature values of point pairs. Assuming that the 5×8 feature vectors of a feature point in the left and right images were V L and V H , respectively, D L-H was used to describe the difference between the two feature points: 5 8 The smaller value of D L-H indicates the smaller difference between the two feature points. Hence the point pair with the minimum D L-H value was the initial matching point pair.
Step3: Eliminate the mismatching point. Although the information of the neighborhood of the point was fully utilized in searching for the initial matching point pair, there still remained mismatching points. Fox example, the point in Figure 7a indicated by the arrow matched the point indicated by the arrow in Figure 7b according to the Equation (4), but obviously they did not match. The correct matching point should be the point indicated by the arrow in Figure 7c The mismatching points were eliminated by calculating the energy of feature points in this study [23] . This method made full use of the relationship between the current initial matching point and the surrounding matching point, with the main idea as follows: 1) Determine the initial matching point pairs f l (x,y) and f r (x,y); 2) Determine the initial matching point collections M l and M r , which were centered on the left and right points, respectively, with the parameter φ as the radius. The more the number of initial matching point pairs in the collection, the greater the energy of the feature point. P L-H was used to describe the probability of correct matching point pair, which could be calculated using Equation (5) where, m l and m r were the element number in the two collections, which were also called energy, and m l-r was the number of successful initial matching point pairs in the two collections. The larger the P L-H was, the greater probability that the point pairs matched correctly. The appropriate threshold value P was then used to judge whether the initial matching point was the correct matching point. According to the coordinate information of the correct matching point pair, the depth information Z of each matching point in the farmland road could be calculated by Equation (6): where, |x r -x l | was the horizontal parallax of matching points between the left and right images; 2a was the center distance between the two cameras, and f was the focal length of the camera.
Both 2a and f could be obtained through the camera calibration. By interpolating and fitting the depth information of the matching point, a complete and smooth 3D depth reconstruction map of the farmland road could be obtained, which could provide reliable information for visual navigation.

Results and discussion
60 pairs of farmland road images were acquired using the above calibrated test platform for validating the effectiveness of the proposed algorithm in this study. The images were acquired in different conditions, such as sunny, partly cloudy and cloudy for the weather conditions, and straight road, curve road, upper and lower ramp road, and watery road for the road conditions. Figure  8 shows part of farmland road images acquired in different conditions. Depth information measurement was conducted on the test sites, which was further used to compare with the experimental results for validating the performance of the proposed method in this study. Figure 9(a) was the sample image of the left CCD camera in Figure 1, in which ten artificial measurement points of depth information were labeled, and Figures 9b and 9c were the contour map of depth information and 3D depth map after 3D depth reconstruction, respectively. The projection size was the same as the sample image. The value of a certain point in Figures 9b and 9c was the distance between the corresponding point in the road surface and the center point of the binocular camera. Table 2 and Table 3 were summaries for the matching quality of the left and right image in Figure 1, and error statistics of depth information recovery, respectively.  Table 2 Comparison of matching quality for left and right images In order to test the versatility of the proposed algorithm, feature point matching and depth recovery experiments were conducted for 60 pairs of farmland road images. Figure 10 was comparison of the feature point distribution extracted by the conventional and improved SIFT algorithms from part of the images in Figure 8. Table 4 shows statistical results for the number and uniformity of the feature points, and Table 5 shows the comprehensive statistics for the matching quality of these 60 pairs of images by using the improved SIFT algorithm.
The following conclusions can be drawn from the experimental analysis: (1) It was observed from Table 2 that the successful matching point pairs by the conventional and improved SIFT algorithms were 755 and 825, respectively, and the utilization ratio for the feature points were 20.51% and 25.03%, indicating that the proposed method in this study had higher utilization efficiency for the obtained feature points. Though both the two methods had relatively high matching accuracy, time consuming of the proposed method was 5.60 s, approximately two fifths of the conventional SIFT algorithm, which significantly improved the real-time performance for the feature point matching.
(2) As shown in Figure 9 and Table 3, the depth reconstruction accuracy using the calibrated binocular stereo vision system was within -7.17%-2.97%. The camera had smaller distortion when the farmland road was closer to the camera, resulting in smaller error, and the accuracy could be ±5% when the depth was within 2 m. These results showed that the method proposed in this study could accurately restore the 3D landform of farmland road, which provided a fundamental basis for visual navigation and vehicle control.   (3) As shown in Figure 10 and Table 4, the uniformities of feature point distribution by using the conventional and improved SIFT algorithms were 33.82%-56.33% and 65.67%-76.16%, in the case where the density of feature points was basically the same (around 4000 pixels), indicating that the method proposed in this study performed much better. Moreover, the uniformity for the conventional SIFT algorithm was only 33.82% when the road condition was relatively straight (Figure 10a4), which was much worse than that obtained by the improved SIFT algorithm (71.53%).
(4) It was observed from Table 5 that the mean uniformity, time consuming, and matching accuracy for feature points by the improved SIFT algorithm were 50%-70%, 6-6.5 s, and higher than 88%, respectively. Different road conditions and weather conditions have great influence on the matching quality. Straight road and sunny weather performed better matching quality. However, bad weather increased the difficulty in acquiring high quality images, and the curve or ramp road caused a large difference in road features between the left and right images, with some of the feature points located in the region with large camera distortion, thus reducing the overall matching accuracy.

Conclusions
The mean image of pixel value in five channels (R, G, B, S and V) were treated as the inspected image in this study, on the basis of image color, color saturation and brightness information. The Canny operator was used to obtain the edge points of the image, which was then treated as the feature points. Combining with the conventional SIFT algorithm, it was finally used for depth recovery of the farmland roads. The results showed that the uniformity and utilization of the feature points achieved by the improved SIFT algorithm were superior to that of the conventional SIFT algorithm. The average time consuming of the improved algorithm was 5.60 s, approximately two fifths of the conventional algorithm. The mean error for depth recovery of the farmland road image was within -7.17%-2.97%, satisfying the requirements of visual navigation for farmland roads. The improved SIFT algorithm performed well in sunny weather and straight road. However, both the matching quality and depth recovery accuracy should be improved in bad conditions, such as cloudy weather and ramp road. Moreover, the accuracy of the road model should be verified by laser scanning technology to make the farmland road model closer to the real road surface.