Classification of rice seed variety using point cloud data combined with deep learning

Rice variety selection and quality inspection are key links in rice planting. Compared with two-dimensional images, three-dimensional information on rice seeds shows the appearance characteristics of rice seeds more comprehensively and accurately. This study proposed a rice variety classification method using three-dimensional point cloud data of the surface of rice seeds combined with a deep learning network to achieve the rapid and accurate identification of rice varieties. First, a point cloud collection platform was set up with a Raytrix light field camera as the core to collect three-dimensional point cloud data on the surface of rice seeds; then, the collected point cloud was filled, filtered and smoothed; after that, the point cloud segmentation is based on the RANSAC algorithm, and the point cloud downsampling is based on a combination of random sampling algorithm and voxel grid filtering algorithm. Finally, the processed point cloud was input to the improved PointNet network for feature extraction and species classification. The improved PointNet network added a cross-level feature connection structure, made full use of features at different levels, and better extracted the surface structure features of rice seeds. After testing, the improved PointNet model had an average classification accuracy of 89.4% for eight varieties of rice, which was 1.2% higher than that of the PointNet model. The method proposed in this study combined deep learning and point cloud data to achieve the efficient classification of rice varieties.


Introduction 
Rice is a major staple food in China and many other countries and is widely grown. The inspection and classification of seed varieties is an important link in the planting process, and it determines the yield and quality of rice. However, the continuous development of breeding technology has enabled an increasing number of rice varieties to be developed and put on the market to adapt to the planting environment and the tastes of people in different regions, increasing the difficulty of identifying rice varieties during the planting process [1] . The development of machine vision technology and its application in agricultural product detection and identification can not only overcome the various shortcomings of traditional manual detection methods but also have the advantages of being noncontact, non-damaging, fast and accurate [2] . Different varieties of rice seeds often show different external morphological characteristics, such as shape, color, and size [3] . By inspecting the appearance characteristics of rice seeds, the rice seed varieties can be classified. Kuo et al. [4] used an optical microscope to obtain clear two-dimensional images of rice grains of thirty varieties, proposed a classification method based on image processing and sparse-representation-based classification (SRC), and finally achieved a recognition rate of 89.1%. Golpour et al. [5] extracted thirty-six color features from the RGB, HIS, and HSV color spaces of rice grain images, and used a neural network with two hidden layers to classify rice varieties. Mittal et al. [6] extracted geometric feature parameters from two-dimensional images of rice seeds based on image processing technology, and used support vector machines (SVM) to classify and evaluate rice varieties, the system achieved recognition accuracy of 93%. Fabiyi et al. [7] extract the spatial and spectral feature values of rice seeds based on high-resolution RGB images and hyperspectral images, and use random forest classifiers for classification, which can effectively improve the purity of seeds. The two-dimensional image loses the high-dimensional information in the space during the imaging process, and the characteristics of rice seeds that can be extracted from it are limited, so the final recognition rate was not high. Compared with the two-dimensional image, the three-dimensional information obtained based on the surface of the rice seed could describe the appearance characteristics of the rice seed more completely and accurately and has more advantages in the task of classifying rice seed varieties.
Qian et al. [8] successfully constructed a three-dimensional model of rice seeds using the Depth from Focus (DFF) method. Based on a three-dimensional model of rice seeds, eight feature values were extracted and input to the BP neural network, which ultimately reached a recognition rate of 90%. Li et al. [9] proposed a calculation method for the three-dimensional features of the surface shape of rice seeds based on the point cloud obtained by the three-dimensional laser scanning system to further improve the accuracy of the classification of rice varieties. Based on the team's previous research, Feng et al. [10] built a 3D laser scanning system to collect rice seed point clouds, extracted nine three-dimensional morphological surface features and nine cross-sectional projection features to input into the BP neural network for variety recognition experiments, and the average recognition accuracy reached 97%.
As a branch of machine learning, deep learning has made groundbreaking progress in many types of applications in recent years. A deep learning network transforms the features of the input signal into a new feature space through layer-by-layer feature transformation of the input data, and it can automatically learn a hierarchical feature representation, which can effectively classify the data [11] . However, the convolutional structure that often appears in deep learning networks requires regularized data structures as input. Since point cloud data structure is a type of irregular data structure, regularized conversions are required before input to the network, and such conversions often cause problems of structural loss and resolution de-gradation [12] . The proposal of the PointNet [13] network solves this problem well, for the disordered point cloud data, the maximum pool function is used in the network structure to realize the direct use of the point cloud as the input of the network for classification and segmentation tasks. Ma et al. [14] modified the PointNet network structure for the study of 3D hand posture estimation, introduced a jump structure to recombine features at different levels, and achieved an actual operating speed of 54 frames/s on the NYU hand posture dataset. Zhao et al. [15] proposed a method of combining local features based on point extraction with global features based on original point cloud extraction to classify LiDAR point cloud features, which solved the problem of weak local feature extraction capabilities of PointNet and achieved better results than the classification results of PointNet.
The existing three-dimensional data-based rice variety recognition methods use traditional neural networks as the classification model, and the recognition effect is usually limited by the number of extracted rice characteristics. This study proposes a rice variety recognition method based on an improved PointNet model, eliminating the need for an artificially designed feature extraction process to obtain faster recognition speed and higher recognition accuracy.

Sample preparation
Eight varieties of rice seeds and 210 samples of each variety were prepared, including Fengkang 30, Huajing 7, Tianyou 673, Zajiaodao, Liannuo 1, Nanjing 9108, Guiyu 11 and Liandao 6. To ensure the diversity of sample sources, these rice seed samples came from various regions in China. Liannuo 1, Huajing 7 and Nanjing 9108 were from Jiangsu Province; Tianyou 673 was from Fujian Province; Liandao 6 was from Heilongjiang Province; Guiyu 11 was from Guangxi Province. Some of the samples used in this work are shown in Figure 1

Point cloud collection system
The point cloud collection in this study was obtained with a 3D point cloud collection system. The 3D point cloud collection system was primarily composed of a light field camera (R42) manufactured by Raytrix and a high-speed GPU (NVIDIA GTX 1080). The light field camera was a focused plenoptic camera with 41.5 Megarays and a resolution of 7708×5352 pixels. The imaging lens is a 3D light field lens with a focal length of 50 mm and an aperture of f/2.80. High-speed GPUs were used for light field processing. The details of the camera model and depth estimation theory can be found in an article by Johannsen et al. [16] The 3D light field camera was installed on the vision platform for support, and a ring light source with adjustable brightness was installed between the light field camera and the base of the vision platform. The process of point cloud collection was performed in a dark room. The structure of the experimental devices is shown in Figure 2. Affected by the device accuracy, camera resolution, environmental factors, and operator experience, the collected point cloud may contain noise and voids. Preprocessing was performed using RxLive4.0 software. In the experiments, for filter processing, a bilateral filter was used, and the filter radius was set to 20 pixels. The 'standard' fill algorithm was used for fill processing, and the number of iterations was set to 16 with a lookup distance of 10 pixels. For the smoothing processing, the edge smoothing factor was set to 0.100.
The Raytrix R42 light field camera has not only the ability to record three-dimensional information but also the ability to record color information. Therefore, the RxLive4.0 processing software provided with the camera can export files in multiple formats. In this experiment, the data in point cloud format were selected for processing. Open a file in ply format with Mashlab software to display a three-dimensional model of a rice seed after preprocessing. This format file records both three-dimensional information and color information, as shown in Figure 3. In this work, the original point cloud data were collected by a focused plenoptic camera, including the platform base and target rice seed. To obtain the target rice seed point cloud, the data of the platform base must be removed. The random sample consensus (RANSAC) algorithm [17] was used to calculate the parameters of the plain of the platform base. Then, the calculated parameters were used to obtain the point cloud of the plain and remove it. Then, the segmented target rice seed point cloud was stored for further processing. The point cloud segmentation results are shown in

Point cloud downsampling
The focused plenoptic camera has a high resolution and can capture very detailed features of objects. However, for the classification of rice seeds, it captures a large amount of redundant data. Storing, processing, and displaying these point cloud data will increase the processing load of the computer, occupying more computer resources and reducing the amount of point cloud data storage and operational efficiency.  The voxel-based grid [18] method and random sampling method were used to process the downsampling of the point cloud. First, a minimal three-dimensional voxel grid was created based on a point cloud bounding box. Then, the voxel grid was divided into m×n×l small grids with side length L. In each small grid, all the points were replaced by their centroids. The number of point cloud points was reduced to nearly, but more than, 2048 points. Finally, a random sampling method was used to reduce the number of point clouds to 2048 points specifically. The downsampled point cloud preserves the structural information of rice seeds well and maintains a clear edge contour, as shown in Figure 5.

Point cloud translation and normalization
To ensure that the similarity measure of the shape is not affected by the translation and scaling of the point cloud, point cloud translation and normalization are important preprocessing steps [19,20] .
To ensure translation invariance, the center of the mass of the rice seed point cloud needs to be translated to the position of the coordinate origin. To ensure the invariance of the scale size, the rice seed point cloud after the translation transformation also needs to be normalized to the standard cell size.
where, x max , y max and z max are the maximum values among all points; x min , y min and z min are the minimum values. Then, move the origin of the coordinate system of the point cloud to p center (x mid , y mid , z mid ) to form the new point set P' = {p i | p i = p i − p center , p i P}.
2) Point cloud normalization. First, calculate the scale of the transformation l scale = x maxx min (2) With the point set P' obtained in the previous step, compress the point cloud coordinates to between −1 and 1, forming the last point set P'' = {p i | p i = p i /l scale , p i P'}.

Classification models 2.4.1 Basic PointNet architecture
Since it is affected by the acquisition facilities and the spatial coordinate system, the order of point cloud data should be very different when objects are scanned at different facilities or different locations [21] . To make a model invariant to the input set in terms of the data feeding order, PointNet applies a symmetric equation to the transformed elements in the set as follows: where, f: 2 N R R  , h: R N →R K and g: nK RR  is a symmetric equation.
The equation h is approximated by a multilayer perceptron network, and the function g is approximated by the composition of a single-variable function and a max-pooling function. Through the collection of h, the network can learn a number of fs to capture different properties of the set.
The classification network architecture of the basic PointNet is shown in Figure 6. The network input is the 3D coordinates (N×3) of a 3D point cloud containing N points. First, a mini-network (T-net) is used to predict the 3×3 affine transformation matrix, and this transformation is directly applied to the coordinates of the input set to obtain an aligned N×3 input set. This mini-network consists of the basic modules of point-independent feature extraction, a max-pooling layer and a fully connected layer, and its scale is close to that of a large network. The aligned N×3 point set is extracted by a shared-parameter Multi-Layer Perceptron (MLP) model (64, 64) to obtain N×64 features. Then, through the feature transformation matrix prediction network T-net (64), the 64×64 transformation matrix is predicted to act on the extracted N×64 features to achieve feature alignment. Then, three layers of MLP (64, 128, 1024) are used to further extract the aligned N×64 features to obtain N×1024 features. The max-pooling layer aggregates the extracted feature vectors of the N points into 1024-dimensional global feature vectors that do not change the arrangement of the input points. Finally, a three-layer MLP network is used to map the 1024-dimensional global feature vector to a k-dimensional vector output.
Note: MLP: Multi-Layer Perceptron, the same as below.  The operation process of the improved PointNet network is as follows: 1) The input transformation small network aligns the original input point cloud (N×3) to obtain an aligned point cloud (N×3); 2) Perform feature extraction on the point cloud through the first multilayer perceptron MLP1.
MLP1 is composed of two convolutional layers with 64 channels, and finally, N×64-dimensional point features are obtained; 3) The feature alignment network aligns the N×64-dimensional features extracted in the second step to obtain aligned N×64-dimensional point features; 4) In the second multilayer perceptron MLP2, the first two convolutional layers transform the N×64 dimensional features obtained in the previous step. The channel numbers of the first two convolutional layers are 64 and 256, that is, N×256-dimensional features; 5) Connect the N×256-dimensional feature obtained in the third step with the N×64-dimensional feature obtained in the second step through a cross-layer connection structure to obtain an N×320-dimensional feature; 6) The third convolutional layer in MLP2 transforms the N×320-dimensional features obtained in the previous step to N×1024-dimensional features, and the number of channels is 1024; 7) The max-pooling layer aggregates the N×1024 dimensional features obtained in the previous step into 1024 dimensional global features; 8) The 1024-dimensional global feature vector is reduced layer by layer through MLP3. MLP3 is composed of 3 fully connected layers with nodes 521, 256 and k, and a layer of dropout structure is added between each pair of fully connected layers. The vector k×1 output by the last fully connected layer is the classification result.
The rectified linear unit (ReLU) function was used as the activation function, which solves the problem of vanishing gradients during training and decreases the occurrence of overfitting while speeding up the convergence process, as shown in Equation (4).
where, x represents the feature values of the neurons. Two dropout layers were added behind the first and second fully connected layers to prevent the network from overfitting [22] . The dropout rates were both set to 0.7, which represents the probability of dropping neurons. The size and parameter calculation of each layer of the seven-layer network are shown in Table 1, Conv represents the convolutional layer, max-pooling is the max-pooling layer, FC is the fully connected layer, and the fully connected layer has no core size or step size parameters. N corresponds to the number of input point cloud points, and k is the number of classes. Output k×1 257k Note: Conv represents the convolutional layer; Max-pooling is the max-pooling layer; FC is the fully connected layer; the fully connected layer has no core size or step size parameters. N corresponds to the number of input point cloud points; k is the number of classes, the same as below.

Network optimization
The improved PointNet network was optimized from two aspects: changing the number of convolutional layer channels and network hyperparameters. The convolution structure calculated the original data, learned the characteristics of the input data, and combined the multilayer convolutional network to realize the layer-by-layer transformation of the features to obtain high-dimensional features. It generally has three parameters: the size of the convolution kernel, the step length of the convolution and the number of channels. The number of channels in the convolutional layer greatly affects the feature extraction ability of the network.
Changing the number of channels in the convolutional layer can improve the feature extraction ability of the network. The optimized hyperparameters are batch size and learning rate: batch size refers to the number of point clouds trained in each learning batch during the network training process; learning rate refers to the amplitude of each parameter update during the training process.

Software tools
During the dataset generation process, point cloud library tools [23] and the Visual Studio 2015 platform were used to implement point cloud segmentation, rotation normalization, and point cloud downsampling. Python 3.7 and the TensorFlow framework were used to build the network model, which ran in the central processing unit (CPU). All software operations were based on a Windows 10 64-bit operating system with an Intel (R) Core(TM) i5-7200U CPU and 8 GB RAM.

Dataset
In the experiment, each type of rice seed point cloud dataset was divided into a training set and a test set at a ratio of 5:2, as shown in Table 2. A total of eight varieties of rice seed point clouds were collected, there were 210 samples of each variety and 1680 samples in total. Among them, 150 point clouds were randomly selected from each group of 210 point clouds, and a total of 1200 point clouds were used as training samples. The remaining 60 point clouds of each rice seed were used as test samples, and there was a total of 480 test samples.

Training process 3.3.1 Effects of batch size on model performance
When the batch size is set to 8, 16, and 32, the training curve of the model is shown in Figure 8. The abscissa is the number of iterations in the model training process, and the ordinate is the loss value. It can be seen from Figure 8 that the training curves of the three batch sizes have very obvious differences, indicating that the batch size has a greater impact on network performance. As the batch size increases, the loss value convergence speed also increases. When the batch size is 32, the loss value converges fastest, and the training loss value is the smallest. Therefore, the improved PointNet model finally selected a batch size of 32 for training.  Figure 9. The abscissa is also the number of iterations in the model training process, and the ordinate is the loss value. The convergence speed of the training loss value when the learning rate is 0.001 and 0.0001 is faster than the speed when the learning rate is 0.01. As the learning rate decreases, the convergence speed of the loss value accelerates, and the training rate is 0.001 and 0.0001. The latter half of the curve is closer and has little effect on the further convergence of the loss value. To ensure the performance of the model and make the model train faster, the learning rate of the improved PointNet model is set to 0.001.  Table 3. The number of channels of the convolutional layer of the improved PointNet model was set according to these four combinations, and the model was trained. Figure 10 shows the training process that the curves of the 4 network models are relatively steep in the first 200 iterations, indicating that the loss value converges quickly. After 200 iterations, the curve gradually stabilizes, and the loss value slowly converges and gradually stabilizes. In comparison, the loss curves of the four models with different numbers of convolutional layer channels are relatively close, and the network performance gap is small.  Figure 10 Influence of the number of convolutional layer channels on the training effect of the improved PointNet model The above four models were tested on the test set, and the results are shown in Table 4. The results show that when the number of channels of the convolutional layer of MLP2 is set to 64, 256 and 1024, the improved PointNet model has a recognition rate of more than 90% for four rice varieties, three varieties above 80%, and only one variety below 80%. The average recognition rate is the highest among the four combinations, and the classification effect is the best. Therefore, the number of channels in the convolutional layer in MLP2 of the finally improved PointNet model is set according to Combination 4.

Classification results
The improved PointNet model is tested for the classification of rice varieties. The test set contains eight varieties of rice, 60 samples of each variety, and a total of 480 rice seed point cloud samples. The classification results are compared with the PointNet, PointNet++ [24] and DGCNN [25] models, as shown in Table 5. Comparing the classification results of rice varieties, it can be seen that: 1) The improved PointNet model was better than the PointNet model in the classification of rice varieties. Compared with the PointNet model, the improved PointNet model had higher recognition accuracy for seven rice varieties except Liandao 1. In particular, the recognition rate of the improved PointNet model of Tianyou 673 increased from 73.2% to 83.3%, and the recognition rate of Liannuo 1 increased from 78.6% to 86.7%, a huge improvement. The average accuracy rate was 1.2% higher than that of PointNet.
2) The performance of the improved PointNet model in the classification of rice varieties was similar to that of PointNet++. The improved PointNet model had higher recognition rates for Zajiaodao, Tianyou 673, Huajing 7 and Liannuo 1 than PointNet++, but PointNet++ had higher recognition rates for Fengkang 30, Guiyu 11 and Liandao 1. The average classification accuracies of the two models were close to each other.
3) The classification accuracy of the DGCNN model for Huajing 7, Liannuo 1 and Guiyu 11 was the highest among the four models, while the classification accuracy for Zajiaodao, Tianyou 673 and Liandao 1 were lower. Among the four models, the average classification accuracy of rice seeds was the lowest.
The above results showed that the performance of the improved PointNet model proposed in this study was better than that of PointNet and DGCNN, comparable to PointNet++, and could be used in rice variety classification tasks with three-dimensional point cloud data as input.

Conclusions
This study classifies rice varieties based on 3D point cloud data and deep learning algorithms. In the experiment, a Raytrix light field camera was used to collect the original 3D point cloud data of eight varieties of rice seeds. The 3D model of the rice seeds obtained after preprocessing with RxLive software completely and accurately expresses the shape characteristics of the rice seeds. The construction method of a three-dimensional point cloud dataset for rice variety identification was studied, including point cloud segmentation, downsampling, translation and normalization processing algorithms. An improved PointNet model was proposed: by adding a cross-level feature connection structure to PointNet, low-level features and high-level features were connected, feature fusion was realized, and the utilization of features by the network was improved. Finally, testing on the test set shows that the improved PointNet model had an accuracy rate of 1.2% higher than that of the PointNet model, which was 89.4%, and the accuracy rate of the seven varieties was higher than that of the PointNet. Compared with PointNet++ and DGCNN, the average classification accuracy of the improved PointNet model was also higher. Regarding future work, it is planned to expand the dataset to achieve more classification and identification of rice varieties and to ensure the accuracy of the classification.