Development of a mobile application for identification of grapevine (Vitis vinifera L.) cultivars via deep learning

Traditional vine variety identification methods usually rely on the sampling of vine leaves followed by physical, physiological, biochemical and molecular measurement, which are destructive, time-consuming, labor-intensive and require experienced grape phenotype analysts. To mitigate these problems, this study aimed to develop an application (App) running on Android client to identify the wine grape automatically and in real-time, which can help the growers to quickly obtain the variety information. Experimental results showed that all Convolutional Neural Network (CNN) classification algorithms could achieve an accuracy of over 94% for twenty-one categories on validation data, which proves the feasibility of using transfer deep learning to identify grape species in field environments. In particular, the classification model with the highest average accuracy was GoogLeNet (99.91%) with a learning rate of 0.001, mini-batch size of 32, and maximum number of epochs in 80. Testing results of the App on Android devices also confirmed these results.


Introduction 
To meet the growing needs of the increasingly affluent and growing population, plant productivity must be improved, and resources must be managed more effectively. This problem is being severer in conditions of climate change and natural resources shrinkage in recent years. Precise breeding by selecting favorable genomic variants is helpful to improve plant productivity and efficiency.
However, this method depends on a detailed understanding of the relationship between genotype and phenotype [1] . The framework of the automatic feature (phenotype) extraction and classification during the plant growth stage can significantly promote precise breeding.
As the pillar of the wine industry, the grape is receiving increasing interest and now has genome sequence from thousands of germplasm [2][3][4] .
These functions can be calculated manually or through a customized image processing algorithm. The main disadvantage of using hand-made descriptors is that although being easy to interpret, they may fail to evaluate the actual characteristics of variability between germplasms. In addition, the customized image processing method for extracting manual features may not work well in other experiments and may be difficult to generalize to more heterogeneous datasets in practical applications [9][10][11] . Therefore, more advanced approaches are desirable for automatic genotype classification.
In the past few years, deep neural networks especially convolutional neural networks (CNN) [12][13][14][15][16][17][18] have been drawing increasing interest in both academia and industry due to their promising performance over conventional machine learning algorithms. In particular, the improved performance is mainly due to the complex structure of CNN, the substantially increased volume of the dataset and the significantly improved computation power. Different from conventional machine learning approaches, CNN can automatically extract the most descriptive and discriminating features from the image dataset in the training stage. In these approaches, the feature extraction and training steps are performed concurrently, where the algorithms attempt to automatically learn the features which can minimize the loss criteria for phenotype problems. As a result, the features for the problem of interest can be automatically generated. It is also noted that the training process for CNN approaches is usually time-consuming and also requires a large volume of the labelled dataset.
The traditional method of the identification of vine varieties by vine leaves is usually needed to do destructive sampling, followed by physical, physiological, biochemical and molecular measurements [19,20] . Although these methods have been proved to be effective, they are time-consuming, labor-intensive and usually require experienced grape phenotype analysts. To mitigate these problems, this study aimed to develop an application (App) running on Android client to identify the wine grape automatically and in real-time, which can help the growers to quickly obtain the variety information. This was achieved by seamlessly integrating some advanced tools and algorithms such as mobile devices for imaging, CNN for algorithm design and cloud servers for computing. Nowadays, there is a wide range of options offering mobile android devices. And even low-range devices with affordable prices provide excellent computing and photographic capabilities, which makes it possible to develop specialized applications in many fields. However, application for viticulture is not prolific in this sense yet, since there are only a few examples of available applications for managing vineyards [21,22] . In particular, mobile phones are first used to collect vine leaves of different cultivars to build the training datasets. Then CNNs of various architectures are trained on the labelled dataset to derive the suitable classification model. Then mobile App is built to automatically classify new vine leaves by the CNN model via cloud server so that different categories of the wine grape can be identified in complex field environments. To be more exact, the main contributions of this study are summarized as below: 1) A vine dataset including 5091 leaf images of 21 vine types was manually collected in field conditions; 2) Various CNN based image classification models were trained on the dataset to identify the suitable ones in terms of accuracy on validation dataset; 3) An Android App was developed which can automatically identify vine varieties by vine leaves via the cloud server.

Vine dataset
In this study, the vine dataset contains twenty-one varieties covering common cultivars in vineyards. The images in the dataset were taken by Canon EOS 70D in different vineyards in Ningxia Hui Autonomous Region, China. As a consumer camera, Canon EOS 70D equipped with an approximately 20.2-megapixel CMOS sensor has strong adaptability for different scenes. So it is considered to be the representative of most lens of mobile phones in this study. Additionally, images (JPEG) stored on the camera's memory card were transferred and saved to cloud server. The images were taken without control conditions in the field. The image mainly consists of leaves, but also vines, soil and people. The sample images for twenty-one cultivars are displayed in Figure  1

Data preprocessing and augmentation
The performance of the vine classification task was improved by applying various image pre-processing techniques which can automatically obtain the strengthened localized image features from the original input image [23] . Firstly, raw images were processed into their complement image. Each color channel of the resulting image was the complement of the corresponding color channel in the original image [24] . In the original image, the leaves appear green because of a mixture of red and green signals. In the complement images, however, the leaves appear purple because the red and blue signals are higher than the green signal. Then, all the processed vine leaf images were resized to the same size as the input size for various CNN architectures.
Considering that deep learning algorithms usually have a high requirement on the volume of the labelled dataset (although a relatively large number of vine leaf images have been collected in this study), data augmentation techniques were further deployed to augment the labelled dataset so that the learned model can generalize well to unseen samples. In particular, geometric transformation techniques such as scaling, transposing, rotation and flipping were applied to expand the image dataset. This type of image augmentation technique is generic and computationally efficient to train the deep learning models effectively [25] . At the same time, Gaussian white noise with a mean and variance of 0.01 was also added to the image to improve algorithm robustness. 200 images were randomly selected for data augmentation of each variety as raw images. The total number of samples in the augmented dataset was finally 33 600 and 1600 images are contained per variety. 70% per variety including 1120 images are randomly selected as the training set to train the classifier. 30% per variety including 480 images were used as a validation set to evaluate the performance of the classifier. In addition, extra 10 images per variety that were not used for both training and validation were used as a test set to evaluate the performance of the App.

Transfer learning of deep networks
Deep learning is currently one of the most popular methods and has shown great performance on many image classification problems in the field of plant phenotyping. The concept of sharing weights in deep learning makes an effective image classification by discovering robust features in the images and reducing the vanishing gradient problem. The structure of CNN generally includes convolution layer, pooling layer, and fully connected layer. The convolutional layer acts as a filter, which aims to automatically extract image features. The convolutional layer is usually followed by the pooling layer, which performs downsampling and retains the most important information in the images. This layer reduces the spatial size of representation as well as the number of parameters and therefore prevents the problem of overfitting, making the model more effective. The last layer is the fully connected layer, which uses a softmax activation function and takes the high-level features from the images for classifying them into various categories [26] .
In consideration of the limited number of labelled datasets, the concept of transfer learning was applied to retrain the deep learning classifier. This concept is not new and has been previously applied in a number of studies [27][28][29] . The vine classification tasks are evaluated in terms of accuracy and efficiency. In transfer learning, the layered architecture of the pre-trained models such as AlexNet, ResNet and VGG (without its final classification layer) can be used as a fixed feature extractor to achieve better vine classification performance with a shorter training time [30] . In this study, five deep learning networks were explored for vine leaf classification via transfer learning, which include AlexNet, ResNet, GoogLeNet, DenseNet and VGG that are pre-trained networks trained on images from the ImageNet database [31][32][33][34][35] . The concept of fine-tuning was adopted in this study, where the last fully connected layer was replaced and initialized with the target task class. The selection of all training parameters was based on the empirical observation of network training convergence and training effect.
In this study, the performance of the model was evaluated by comparing the classification results of the model with the actual tags. The two commonly used performance indicators included accuracy (ACC; Equation (1)), recall rate (RECALL; Equation (2)) [36] . The data processing involved in the model construction was run in Matlab ® 2019b (The MathWorks, Inc., Natick, Massachusetts, USA). All experiments were carried out on a Linux machine with Ubuntu 16.04 preinstalled. It has a GTX 2080ti GPU, an Intel® core i7-5930k processor and 16 GB DDR4 RAM.
TP TN ACC TP FP TN FN

APP development
An App named VitisView was developed as a tool to accelerate the precise management of vineyards in the study. In particular, CNN based on object classification model was deployed on Android devices via cloud server, which can identify the leaf phenotype of wine grape in field environments and also in real-time. This app can be downloaded from http://175.27.144.115:8080/ VitisView.apk. The VitisView was developed in the Android studio, and the main development tool JDK version is 11.0.7. It can call the camera of mobile phone to take images, then the app sends or receives data through socket. It can upload the image of the mobile library to the server and then receive the detection result of this image transmitted back by the server.
Two typical mobile phones were used to test the performance of the app. One is Vivo X9 with Android 7, 4GB RAM and Qualcomm® snapdragon 625 processor, the other one is Mi 9 with Android 10, 6GB RAM and Qualcomm® snapdragon 855 processor. The test set contains 210 images that were not used for training or validation and each category has 10 images. The performance evaluation was carried out in respect of the speed, the time from uploading images to getting classification results and the accuracy.

Cloud server
The trained CNN classification model was deployed on the cloud server. The communication between Android client and cloud server was realized through HTTP (Hypertext Transfer Protocol, HTTP). A java servlet using Apache Tomcat [37] was responsible for processing HTTP messages. The processing pipelines of cloud server are shown in Figure 3. Programs developed based on python provided the functions of downloading images and transferring files from Android devices to the cloud server. Matlab served as the function of running the classification model to process images and output recognition results. The cloud server was configured with Windows Server 2016 Datacenter, Intel® Xeon(R) Platinum 8255C CPU @2.5GHz and 8GB DDR4 RAM based on Tencent Cloud (Tencent, Inc., Shenzhen, China).

Classification model selection
To enable a fair comparison for all the experimental configurations, it was tried to standardize the parameters in experiments. In particular, a set of options was created for training a network using stochastic gradient descent with momentum and reduced the learning rate by a factor of 0.1 every 60 epochs. To set the maximum number of epochs for training to 80 and used a mini-batch of 32 observations at each iteration. To validate the network at regular intervals during training, the Validation Frequency value was chosen according to the size of mini-batch so that the network could be validated about once per epoch. The training was progressing by plotting various metrics was learned during training on Matlab.
The classification results of different CNN network structures were got, including AlexNet, Vgg-16, ResNet101, ResNet18, DenseNet and GoogLeNet. According to the evaluation metric, the accuracy of these classification models is shown in Table 1. The results showed that the accuracy on the validation dataset of the CNN detection algorithms can reach over 94% for 21 cultivars, which proved the feasibility of using transfer learning to identify grape cultivars in the actual growth environment. The detection model with the highest average test accuracy was GoogLeNet (97.4%). In particular, as shown in Figure 4, the prediction accuracy of 9 out of 21 categories of GoogLeNet detection model was 100%. It therefore could be drawn from the experimental results that the pre-trained models with transfer learning provided good results for the dataset with 21 labels of grapes, respectively. Figure 4 Recall rates of the twenty-one cultivars for different networks

Model parameter selection
Upon selecting GoogLeNet as the classification model in Section 2.1, the effects of different training parameters on the classification performance were tested.
Particularly, the influences of learning rate and minimum batch size on training results were considered. Three experiments were set up with the value of learning rate being 0.001, 0.005 and 0.0001 while keeping other parameters unchanged. Table 2 shows the classification accuracy under different learning rate values. The results showed that when the learning rate was 0.001, the classification performance reached the best. In consideration of the size of memory, increasing the minimum number of mini-batch sizes in a reasonable range can improve memory utilization and also increase training speed. For large-scale training, it is necessary to make sure that using larger batch size training gets the test accuracy similar to a smaller batch size under the same epochs. The reason why to keep the number of epochs unchanged was that from a statistical point of view, an epoch means that the algorithm contacts the whole dataset once; from a computational point of view, a fixed number of epochs means that the number of floating-point operations remains unchanged. Three experiments were set up with the value of mini-batch size being 16, 32 and 64 while keeping the learning rate as 0.001 and other parameters unchanged. Table 3 shows the classification accuracy under different mini-batch sizes for the GoogLeNet detection model.
The results showed that the minimum batch size of 32 was enough. It was discovered in the process of training that the max epoch was an important factor that affected the learning progress of the model. Four experiments were set up with the maximum number of epochs being 30, 60, 80, 120 while keeping other parameters unchanged. Table 4 shows the accuracy under different maximum epoch values. The results showed that the ideal maximum number of epochs was 80. Therefore, the optimized parameters for GoogLeNet classification model are the learning rate of 0.001, mini-batch size of 32 and the maximum number of epochs of 80.

App performance
The training and testing of CNN classification models were completed for grapevine identification on Matlab and uploaded to the cloud server. The cloud server performs the functions of caching images, detecting images and feedback the classification results. The app performance was carried on two Android devices mentioned in 2.4. The test results on Mi 9 showed that the accuracy was 98%. The recall rate of 19 categories reached 100%. The minimum recall was 70%. The average response time from uploading images to returning results was 8.25 s. The average speed of the test Internet environment was 1.19 Mb/s. Similarly, we repeated the same experiment on Android 7 mobile phone, and the overall accuracy was 98%, of which 19 categories achieved 100% and the minimum accuracy was 80%. The average response time is 8.28 s and the average speed was 1.33 Mb/s. The test results showed that the VitisView is stable and robust, which could help grapevine cultivar information in the field obtained in real-time.

Discussion
Plant leaves show little difference even in different stages so that they can be often used as the feature in the case of plant cultivar identification. In this study, it has been shown that even in field environments, the images collected by typical smart phones or digital cameras can be used to train deep CNN by transfer learning for accurate wine grape cultivar identification.
There was very little literature on accurately identifying the phenotype of wine grape by nondestructive methods (e.g. imaging based approach) in the past few years. Carlos et al. [38] evaluated the performance of the transfer learning techniques based on AlexNet for the grapevine cultivar identification and reached a test accuracy of 77.30% on a dataset including six varieties with an average of 14 images captured in the field per variety. The accuracy of the trained model reached 99.91%. Besides a large number of images available for training, early dataset augmentation plays an important role in achieving performance. In the case of using transfer learning to solve the problem of object recognition, in order to improve the accuracy of the detection model, images are often acquired in a controlled environment in the laboratory [39] . The image preprocessing method was used to enhance the object or weaken the background on the other hand. Canny edge detection was used to detect the edge in the gray-scale image converted from RGB of insect and suppress the noise, which was a method of data augmentation to facilitate crop pest recognition using transfer learning [29] .
Threshold processing, contour detection and watershed algorithm were used to eliminate the influence of natural background on object detection [28] .
The raw dataset was tried to test without using the image complement processing mentioned in Section 5.2. Similarly, geometric transformation techniques such as scaling, transposing, rotation and flipping were applied to expand the number of image samples in the datasets. Datasets without image complement processing and datasets with image complement processing were trained on GoogLeNet model. While training, two datasets used the same number of images and the same training parameter setting. The results of CNN classification models by the two datasets were compared. The results show that the accuracy of the detection model is 97.49% for the dataset without image complement processing and 99.66% for image pre-processing. In other words, when using the deep learning algorithm to identify grape species in the field environment, the complement processing of the raw image has a certain impact on the performance. When using the classification model to identify the image obtained from Android devices on the cloud server, the image will be first processed by complement and then the result will be obtained by the detection model, which is not complicated.
In addition, the Grad-CAM algorithm was introduced to analyze the impact of image complement preprocessing on classification results. The purpose of Grad-CAM was to calculate the gradient of the final classification score relative to the final convolution layer in the network [40] . The gradient of the output of the softmax layer is the largest part of the convolution Eigen graph, the final result mostly depends on the data. To this end, an image of Weddell was used as an example to assess the effects of image complement preprocessing. In addition to displaying the classification result, the gradient thermal image of the weight in the final convolution layer was obtained by Grad-CAM algorithm. The results in Figure  7 showed that the vein and the main part of leaves had the greatest influence on the classification. After the image complement preprocessing, the leaf vein was more easily recognized. While the model without image complement preprocessing does not work well. This is because the background of leaves has adverse effects on the final classification results. Therefore, image preprocessing plays an important role in achieving a satisfying classification result. a. With image complement preprocessing b. Without image complement preprocessing Note: The first data above each image represents the serial number corresponding to the grape cultivar, and the corresponding relationship is shown in the appendix. The second data above each image represents the confidence level of the classification results. In this study, the VitisView on Android devices was developed to realize automatic real-time identification of wine grape and help grapevine species information obtained in the field. However, this automatic real-time recognition algorithm for wine grapes can be applied not only to mobile devices but also to IoT devices, making it a part of the construction of unmanned farming. In addition, although the identification of different genotypes of wine grape constructed in this study involves 21 common cultivars in the vineyard, it is not enough for the wine industry in the long run. The data set can be enlarged and the algorithm can be modified for more cultivar identification in the future.

Conclusions
In this study, an App named VitisView running on Android client was developed to identify wine grape cultivars in real-time and field conditions, which can help the growers to quickly obtain the variety information of wine grapes. This is achieved by seamlessly integrating a number of techniques and algorithms such as phone imaging, convolution neural network and cloud server. In particular, a total of 5091 leaf images were first collected in the field environment which contain 21 cultivars of typical grapes. Then both image preprocessing and data augmentation were adopted to enhance image features and augment training dataset. On this basis, a number of typical CNN models (including VGG-16, DenseNet, ResNet101, ResNet18, and GoogLeNet) were compared in transfer learning training to identify the suitable one while with model parameter tuning. It is shown that GoogLeNet model outperformed other models in terms of accuracy, model complexity, and robustness with a fine-tuned accuracy of 99.91%. The effects of image complement preprocessing are also assessed by using Grad-CAM algorithm.