Tracking and recognition algorithm for a robot harvesting oscillating apples

Apple fruits on trees tend to swing because of wind or other natural causes, therefore reducing the accuracy of apple picking by robots. To increase the accuracy and to speed up the apple tracking and identifying process, tracking and recognition method combined with an affine transformation was proposed. The method can be divided into three steps. First, the initial image was segmented by Otsu’s thresholding method based on the two times Red minus Green minus Blue (2R-G-B) color feature; after improving the binary image, the apples were recognized with a local parameter adaptive Hough circle transformation method, thus improving the accuracy of recognition and avoiding the long, time-consuming process and excessive fitted circles in traditional Hough circle transformation. The process and results were verified experimentally. Second, the Shi-Tomasi corners detected and extracted from the first frame image were tracked, and the corners with large positive and negative optical flow errors were removed. The affine transformation matrix between the two frames was calculated based on the Random Sampling Consistency algorithm (RANSAC) to correct the scale of the template image and predict the apple positions. Third, the best positions of the target apples within 1.2 times of the prediction area were searched with a de-mean normalized cross-correlation template matching algorithm. The test results showed that the running time of each frame was 25 ms and 130 ms and the tracking error was more than 8% and 20% in the absence of template correction and apple position prediction, respectively. In comparison, the running time of our algorithm was 25 ms, and the tracking error was less than 4%. Therefore, test results indicate that speed and efficiency can be greatly improved by using our method, and this strategy can also provide a reference for tracking and recognizing other oscillatory fruits.


Introduction
Because manual picking for large-scale apple harvesting is time-consuming and labor-intensive, scientists are developing apple-picking robots, for which the machine's vision capability is one of the most important aspects. The accuracy and efficiency of apple recognition and tracking are very important for effective implementation. Consequently, there has been a wide range of studies done on fruit recognition by researchers all over the world. Nowadays, the commonly used segmentation and recognition algorithms mainly include the threshold segmentation method [1][2][3][4][5][6] , K-means clustering algorithm [7] , Fuzzy C-means algorithm (FCM) [8] , region growth method [9] , Least Squares SVM algorithm (LS-SVM) [10] , Artificial Neural Network (ANN) [11] , K-nearest Neighbor method (KNN) [12] , or combinations of multiple algorithms [13][14][15][16] . Algorithms based on convex shell [17][18][19] , concave point [20,21] , Hough transform [22] , chain code and other methods, have also been used to identify overlapping apples.
However, the above studies were all about static objects, which are challenging to effectively apply to practical conditions. In actual picking, wind force and the physical picking action can cause apples to swing and change positions. Thus workers need to wait for the apples to be still or to image repeatedly in order to accurately identify and locate apples, significantly reducing the picking efficiency. In view of these problems, Zhao et al. [23] and Lyu et al. [24] used a template matching method to track the apple position based on the information association of the initial and later images, but it was reported the extracted templates could not cope with the changes of image scale and angle during the camera's approach to the apple, resulting in large errors in tracking. Feng et al. [25] proposed a SURF feature and the RANSAC model for solving affine transformation parameters obtained the target scale information and used a compression tracking algorithm to track target fruit. Because the algorithm needed to use a compression feature to describe target fruit, it was difficult to ensure the real-time efficiency and performance of the algorithm.
In this study, a novel tracking and recognition method for an apple-picking robot was developed. A camera mounted on the execution end of the system was used to obtain an image and used local parameter adaptive Hough transform to identify the apple location in the initial image. In the follow-up images, the correlation between images was used to track the apple. Firstly, the pyramid Lucas-Kanade (LK) optical flow algorithm was used to track the Shi-Tomasi corner, then the affine transformation matrix between the earlier and later frames was calculated, and the base image was used to correct the template and predict the approximate position of the apple. Finally, the template matching was used around the prediction area to track the position of the apple.

Apple recognition
Due to limited experimental conditions and orchard availability, in this study, an apple tree model which consisted of dwarf planting was used to simulate trees in a laboratory environment. For this apply picking process, it is necessary for the robot to accurately recognize the images first.

Image preprocessing
There is noise in the images collected by the image acquisition equipment. In order to reduce the noise and preserve the edges of the apples as much as possible, median filtering was used on the image.
Due to the red color of the mature apples, they are distinct colorwise from other objects in the image. In order to improve the distinction between apples and background, three-channel RGB images were transformed into single-channel gray images, which are more convenient for the subsequent application of algorithms. The two times Red minus Green minus Blue (2R-G-B) color feature was used to process the original images, and a typical result is shown in Figure 1

Image segmentation
Threshold-based segmentation is a common method to segment objects, apples, and their background. After preprocessing, the gray level between the apple and background was dramatically different, as demonstrated by the histogram analysis of the image, which shows obvious peaks and troughs ( Figure 2). In order to improve the robustness of illumination, Otsu's [26] method was used to segment the threshold. Otsu's method is an algorithm to obtain the best threshold value for the image histogram by iterating the threshold value until the intra-class variance becomes the minimum and the inter-class variance becomes the maximum. The effect is shown in Figure 3.

Image perfection
There were many defects in the binary images of the apples, which would cause issues for apple recognition. In this study, a hollow filling algorithm was used to fill the space formed by the calyx at the tail of apple fruit. Open operations were performed on the image using a circular template with a radius of 3 pixels to remove little noise. After these processing steps, the binary image might still retain some areas that have no apples but rather have other objects like branches and leaves, which are shown as non-circular irregular areas in the image. For this reason, the minimum length to width ratio for the bounding rectangle of the connected region was defined as L, and the connected regions with L>2.5 were removed. The following equations were applied: where, W is the minimum bounding rectangle width; H is the minimum bounding rectangle height. At the same time, there may be apples that remain far away and show up as a small circular connected region. Therefore, S T is defined as the area threshold and the connected regions with area S<S T were removed.
where, S max is the area of the largest connected area. In order to prevent the apple tree or the machine body from being damaged when picking apples that are seriously blocked by the branches and leaves, F was defined as the filling rate of the connected area, and the connected areas with F<0.5 were removed.
where, S 0 is the area of the connected area; and S T is the minimum area of the rectangle surrounded by the connected area. The improved binary image is shown in Figure 4.

Apple recognition
According to reference [27], because of different camera angles, differences in the natural environment, varied apple shapes, and apples with other defects such as occlusion overlap. The Hough circle transformation algorithm was proposed to fit and extract apple shapes.
However, depending on the distance and placement of the camera, the sizes of the apples in an image might be significantly different, making it challenging to provide accurate parameters for the Hough circle transformation algorithm. Also, this could easily result in too many output fitting circles, leading to algorithm failure, and it could take a long time to perform the global Hough transform, consequently having poor real-time performance.
In order to solve the above problems, a local parameter adaptive Hough transform method was proposed. The algorithm was as follows: Step 1: The Hough gradient method was used to find the circle in the rectangular range for each connected region of the image. If the radius R=the width of the smallest bounding rectangle in the connected region×0.5, the parameters of the Hough transform were: the maximum radius R max =R×1.2, the minimum radius R min =R/1.2, and the minimum center distance D=R min .
Step 2: The circle whose center pixel is 0, area S<S t , filling ratio F<0.5 was removed. The remaining circles were regarded as the correct apple targets for extraction, and the algorithm ended.
The effect of apple recognition using this algorithm is shown in Figure 5. In target tracking, corners are a type of common tracking feature, which has rotation invariance and robustness to illumination. For the images in this paper, the Shi-Tomasi corners [28] were extracted for subsequent tracking, corner points were extracted from a gray-scale image, and the overall extraction effect is shown in Figure 7. Corner points are represented by black dots in the image.

Corner feature tracking
There is an association of information between the front and back images captured by the system. Using this information association, the target apple can be tracked more quickly and accurately. In this paper, the pyramid LK optical flow method was used to calculate the position of the Shi-Tomasi corner in the subsequent frame, which was used to calculate the affine transformation matrix. During the picture moving process, the optical flow is the rate of each pixel moving in the X, Y direction. An LK optical flow algorithm was implemented based on the following three assumptions: 1. Constant brightness; 2. Continuous time is short or movement is small; 3. The pixels in the same neighborhood have the same motion. Based on the first two assumptions, the constraint equation of the two-dimensional optical flow could be obtained: However, the above equation cannot be solved by only one point. Based on the third hypothesis, the least square method was used to solve for all points in the neighborhood of corner points: Since the LK optical flow is based on the above three assumptions, it is only suitable when the time is extremely short or the displacement is extremely small. In order to be able to meet the strict displacement constraint, it is necessary to construct an image pyramid, and continuously solve it from the top of the pyramid to improve the accuracy of optical flow. The result of implementing pyramid LK optical flow for tracking the Shi-Tomasi corner effect is shown in Figure 8. The black lines in the figure are connected to the position of the same corner between the two frames, and the black dots indicate the position of the corners in the current frame. For real-world data, there might be errors in the tracking of pyramid LK optical flow. In order to improve the robustness of tracking, the forward and backward optical flow error method was used to eliminate the unstable corner points. The optical flow method was first used to forwardly track the corner points, and then the optical flow method was used to backtrack to the original image. If a corner point changed more than 0.01 pixels before and after the tracking, the corner point would be considered to be unstable and would be rejected. The picture after eliminating the unstable corner points is shown in Figure 9. Figure 9 Remove the unstable corners

Matching recognition
Template matching, the most primitive and basic pattern recognition method, can be used to find the position of the target apple template in the subsequent image. The equation is as follows: where, M and N are the template sizes; T is the template pixel value; I is the image pixel value to be matched; T is the template average pixel value, and I is the average pixel value of the image to be matched. Template matching has limitations, predominantly because it can only work in parallel. If the matching target in the image rotates or changes in scale, the algorithm is invalid. Because there are many similar apples in the image, if the matching area is not well predicted, it would be easy to match one apple template image with another similar apple, leading to tracking failure.
In this paper, an affine transformation matrix was used to solve the problems in template matching. The matrix form of affine transformation is: where, M is the affine transformation matrix; ρ is the scale factor; θ is the rotation angle; t x is the x-direction translation, t y is the y-direction translation. Generally, the affine transformation matrix can be solved by only three pairs of points, but sometimes there are more than three pairs of points and possible external points. Therefore, this paper solved the problem by using an affine transformation matrix based on the random sampling consistency algorithm (RANSAC).
The scale factor ρ and rotation angle θ can be decomposed from the affine transformation matrix, and they are used to correct the template and solve the scale and rotation problems in the template matching process.
At the same time, the approximate position of the template image in the current frame can be predicted by using the affine transformation matrix, as shown in Equation (9). In this study, based on the apple position predicted by Equation (10), a de-mean normalized cross-correlation algorithm was used to match the template within the range of 1.2 times the size of the template. This not only solved the problem of a possible mismatch but also reduced the scope of template matching and improved the algorithm speed. This size range was found that, 1.2 times of the template, was optimal, but the range could be adjusted according to the intensity of apple oscillation by the following equations.· cos sin sin cos which is the same as: where, (x r, y y ) is the coordinate of the apple center in the previous frame; (x c , y c ) is the coordinate of the apple center in the current frame.

Algorithm flow
A flow chart for fast tracking algorithm applied to each apple target between the two frames is shown in Figure 10.

Experimental condition
The algorithm programming environment was Ubuntu 16.04, and the machine was equipped with an Intel i5-4210M 2.6GHz CPU, 4GB of memory. The OpenCV3. 4.3 open source visual library was the main tool used to complete the algorithm programming. The camera was installed on the execution end. In order to improve the processing speed, the 1920×1080-pixel-image collected was compressed to 640×360 size. In the laboratory environment, simulated dwarf apple trees were used.

Fast recognition of apples
In order to verify the accuracy of apple recognition, various growth postures of apples in the natural environment, such as occlusion and overlap, were simulated. Thirty images of apple trees with different angles and placement were taken. The average running time of the recognition algorithm was 72 ms. The recognition success rate was 91.3% in the laboratory setting, though challenges remain in extrapolating to larger numbers of pictures due to the limited experimental conditions. Representative samples of the recognition results are shown in Figures 11a-11d. a. b.
c. d. Figure 11 Representative results for fast recognition algorithm for apples In most cases, the influence of occlusion and overlap artifacts present in the process of apple recognition can be overcome by the algorithm. Sometimes, as shown in Figure 11c, the three apples close to the border on the right overlap and are not correctly identified by the algorithm.
This was caused by an error in the calculation of the radius parameter range, for the local parameter adaptive Hough transform method. When three or more fruits are overlapping, if their positioning just makes the length-width ratio L of the minimum bounding rectangle close to 1, then the minimum bounding rectangle width will be approximately considered as the apple radius by the algorithm, resulting in recognition errors.

Fast tracking of apples
In order to evaluate the performance of this tracking algorithm, a number of experiments were carried out. The extracted apple position was taken as the true value of the apple position. The ratio of root mean square error (RMSE) to the diameter of the apple extracted from the first frame was used as the criterion of tracking error.  (11) where, RMSE is the tracking root mean square error; N is the number of frames; (x it , y it ) is the apple centroid coordinate obtained by the tracking algorithm; (x ir , y ir ) is the apple centroid coordinate obtained by the recognition algorithm; s i is the ratio of the first frame to the first frame scale, the error is the tracking error, and d is the apple diameter in the first frame.

Fast tracking of static apples
When there is no wind or other adverse weather, apples are almost motionless. Thus, it is important to examine whether the algorithm is compatible with static apple tracking and recognition.
In a laboratory environment, a set of apples were kept still, and the image sequences were acquired during a process in which the execution end of the camera system gradually approached the apples. The tracking algorithm tracked the static apples as shown in Figures 12a-12d Table 1. The tracking error of each apple was about 2%, which means the results are very close to the apple positions obtained by the recognition algorithm and meet the accuracy requirements of automated apple picking. For the static apples, apple positions could be predicted more accurately. After template matching correction, the error could be controlled to be in a very small range.
Error source analysis: This paper used an affine transformation matrix method to correct the template and predict the approximate position of apples.
Although the accuracy of the affine transformation matrix has been improved by multiple means in this paper, some small errors, which accumulate with the running of the program, are unavoidable. However, since only a few frames needed to be tracked from the recognition to the final approach to the apple, cumulative errors in the affine transformation matrix had little effect on the template correction. At the same time, having a range of 1.2 times of the area to be searched also resulted in enough redundancy to mitigate the position prediction errors.

Fast tracking of oscillatory apples
Apples tend to swing under the influence of factors such as wind and the picking action. In order to simulate the effect of wind force or picking action, the apple tree was shaken horizontally by hand. The image sequences were captured as the execution end of the camera system approached the apples. Figure 13a- Table 2. Apple numbers were assigned from left to right in the horizontal direction. It is apparent that when used for tracking swinging apples, the system would accumulate tracking errors to a certain extent. The main reason is that on the basis of static apple tracking error, an apple in motion could produce a small angle rotation in its swinging process, thus causing a certain degree of error in template matching. However, because the apple template used in this paper was extracted from a 2R-G-B color difference image, for which the background was almost black, and the apple was round, this made differences in the apple rotation indistinguishable before and after in the 2R-G-B color scheme. As a result, the template would be minimally impaired. However, although this tracking error increases to a certain extent, it remains below 4%, which meets the accuracy requirements for automated apple picking.

Algorithm comparison experiment
In order to show the superiority of this algorithm relative to other tracking methods, this algorithm was compared with two other algorithms on the same video sequence: (1) Algorithm 1: the algorithm in this paper.
(2) Algorithm 2 [24,25] : in which the template image was not corrected, otherwise the rest was consistent with the algorithm in this paper.
(3) Algorithm 3: in which the prediction for apple position was not carried out otherwise the rest was consistent with the algorithm in this paper.
The operational effectiveness of algorithm 1 is shown in section 4.3.2.
The effectiveness of algorithm 2 for tracking oscillating apples is shown in Figures 14a-14d The running time of the algorithm was about 25 ms. The tracking error was larger than 8%, as shown in Table 3. The reason for the increase in the tracking error is that as the camera approached the apples, the image scale became larger and larger, while the template image scale was not modified; also there was a large gap between the image scale and the template image scale, directly leading to a large error in template matching. The comparison of the tracking errors of the three algorithms is shown in Figure 16. It can be seen that the algorithm in this paper has the smallest error and best stability compared with the other two, indicating that it is most suitable for a stable and robust apple picking robot.

Conclusions
(1) In this paper, an algorithm for tracking and identifying oscillating apples, with facile implementation into apple-picking robots, was proposed. In this method, the first image was medianly filtered, a 2R-G-B chromatogram was calculated, and then Otsu's method was used for image segmentation. In order to overcome the problem of global Hough transform, a local parameter adaptive Hough transform was used to recognize apple images. Experiments showed that the average running time of the apple recognition algorithm in the laboratory environment was 72 ms, and the recognition success rate was 91.3%.
(2) In order to track target apples, the algorithm extracted the apple template image and the Shi-Tomasi corner of the image from the first frame. In the following image, the pyramid LK optical flow method was used to track the corner points, and the positive and negative optical flow error method was used to remove the corner points with large tracking errors.
Then, an affine transformation matrix was calculated based on the random sampling consistency algorithm (RANSAC), which was used to correct the template scale as well as the angle and predict the approximate positions of the apples. Finally, the best positions of the apples were searched in the range of 1.2 times of the predicted apple regions by using the algorithm of de-mean normalized cross-correlation template matching. The experimental results showed that the algorithm could track the apples well regardless of whether they were still or oscillating. The average running time of the algorithm was 25 ms, the tracking error for static apples was less than 2%, and the tracking error for oscillating apples was less than 4%.
Compared with other algorithms, the real-time performance and accuracy of the algorithm were greatly improved, thus meeting the requirements for fast and accurate apple picking robots.
(3) At present, there persist some defects in the algorithm that need further improvement. First, when the recognition algorithm is faced with multiple overlapping apples, if the apples are not arranged in a straight line, it is easy to obtain recognition errors. Second, there are accumulating errors in the affine transformation matrix, which is the main source of tracking errors. Third, this paper has not yet tested the robustness of the algorithm in different lighting environments. However, because the pyramid LK optical flow is robust in different lighting conditions, and a variety of methods have been applied to minimize the tracking error, it can be reasonably speculated that the algorithm in this paper will perform well in different lighting conditions.