Monitoring model for predicting maize grain moisture at the filling stage using NIRS and a small sample size

: The change in the maize moisture content during different growth stages is an important indicator to evaluate the growth status of maize. In particular, the moisture content during the grain-filling stage reflects the grain quality and maturity and it can also be used as an important indicator for breeding and seed selection. At present, the drying method is usually used to calculate the moisture content and the dehydration rate at the grain-filling stage, however, it requires large sample size and long test time. In order to monitor the change in the moisture content at the maize grain-filling stage using small sample set, the Bootstrap re-sampling strategy-sample set partitioning based on joint x - y distances-partial least squares (Bootstrap-SPXY-PLS) moisture content monitoring model and near-infrared spectroscopy for small sample sizes of 10, 20, and 50 were used. To improve the prediction accuracy of the model, the optimal number of factors of the model was determined and the comprehensive evaluation thresholds RVP (coefficient of determination ( R 2 ), the root mean square error of cross-validation (RMSECV) and the root mean square error of prediction (RMSEP)) was proposed for sub-model screening. The model exhibited a good performance for predicting the moisture content of the maize grain at the filling stage for small sample set. For the sample sizes of 20 and 50, the R 2 values were greater than 0.99. The average deviations of the predicted and reference values of the model were 0.1078%, 0.057%, and 0.0918%, respectively. Therefore, the model was effective for monitoring the moisture content at the grain-filling stage for a small sample size. The method is also suitable for the quantitative analysis of different concentrations using near-infrared spectroscopy and small sample size. Monitoring model for predicting at the


Introduction 
The moisture content of maize grains is an important indicator for determining the timing of mechanical harvesting, predicting the yield, grading, and safe storage [1][2][3] . The development of new maize varieties suitable for mechanical harvesting is a major research focus in maize breeding. Therefore, the monitoring and testing of the moisture content at the grain-filling stage are important for crop growth and breeding, Accurate moisture content monitoring not only enables crop managers to deal with water stress caused by external factors [4] and to predict the maturity and quality of the maize ears but also serves as a tool used by crop experts for breeding and seed selection.
In maize breeding, seed and its parents are very valuable [5] .
Because self-pollination of maize ears is rare, dozens or even a dozen grains in a maize mature ears. The moisture content at maize grain-filling stage is always higher than 30%; therefore, drying methods are required. During drying, only 150-200 grains in the middle of the ears can be dried for moisture content measurements [6] . Accordingly, a large sample size, equipment, and handling time are required for determining the moisture content. However, in maize breeding trials, the planting area is usually small and a small number of maize ears are available. Therefore, the sample is commonly low for moisture content measurements at maize grain-filling stage. Near-infrared spectroscopy (NIRS) combined with chemometrics has become a popular technique for quantitative analysis [7][8][9] , quality detection [10][11][12] , and identification of seed varieties [13] . As a result of the changes in the actual application requirements and the continuous advancement of technology, NIRS has gradually become a monitoring technique in areas such as food, medicine, environment [14,15] , materials [16,17] , and crop growth [18,19] . The near-infrared devices have become smaller and real-time applications and high model accuracy are required for many monitoring and analysis methods [20] . Most of the related research has occurred in the food and medicine fields. In food research, Lopes et al. [21] applied NIRS to monitor the peroxidase bio-catalytic reaction in horseradish. Ringsted et al. [22] used NIRS to monitor the aging process of wheat bread. Genisheva et al. [23] monitored volatile compounds in wine. Other researchers used NIRS to monitor the fermentation processes of solid ethanol [24] , cider [25] , and rice wine [26] . In medical research, NIRS was used to monitor drug production [27,28] and extraction processes [29] . Researchers have also used NIRS sensors for the continuous monitoring of blood glucose levels [30] and the real-time monitoring of cell biology [31] . However, few studies have been conducted on the use of NIRS for crop kernel growth monitoring.
The sample size requirements for the collection and analysis of near-infrared spectral data during crop growth are several times higher than those for conventional quantitative and qualitative analysis. The sample size is determined by the number of near-infrared monitoring times during crop growth. In general, the sample size required for the quantitative or qualitative analysis of NIRS data is generally from 100 to 200 [32] and a spectrum sample consists of 3-5 grains for destructive sampling and 50 grains for non-destructive sampling. If 7 times the number of grains is required to determine the moisture in the grain-filling stage, one thousand grains are required for destructive sampling and ten thousand grains for non-destructive sampling using NIRS. The number of grains would be higher when a dry measurement method is used. Therefore, the development of a model for crop growth data monitoring using NIRS and small sample sizes is needed to meet the real-world application requirements.
In recent years, many studies have been conducted on the analysis of small sample size dataset [33,34] . Commonly, the sample size is increased using certain methods to improve the results. The bootstrap algorithm [35] is a common resampling method for small sample set in chemometrics and its reliability has been demonstrated by many researchers [36][37][38] . We have previously investigated the feasibility of using the bootstrap method for the quantitative analysis of the maize moisture content at the grain-filling stage [39] and a similar study was conducted by other researchers [35] . In the NIRS data, the distribution does not have to be considered and pre-processing is not required for small sample set when using the bootstrap algorithm and 10 is the minimum critical stability number of samples for small sample set [39] .
In this study, we developed a monitoring model for predicting the change in the maize moisture content at the grain-filling stage using full-spectrum NIRS and small sample sizes of 10, 20, and 50; we use the bootstrap and x-y distance sample partitioning (SPXY) and the partial least squares (PLS) regression. The model has two key steps, i.e., the selection of the optimal number of factors and the sub-model screening. The proposed model improves the efficiency and accuracy of the moisture content determination during crop growth, reduces the cost of the measurements, and provides a new method for moisture content determination during crop growth. The results do not only contribute to crop growth monitoring and seed breeding research but also provide a new method for the quantitative analysis of data with small sample sizes and different concentrations using near-infrared spectroscopy of full spectrum range.

Materials
The samples were collected from the maize test base at the Heilongjiang Bayi Agricultural University. The base is located in the continental monsoon climate zone of the Daqing North temperate zone in China, where summers are short.
The temperature was in the range of 24°C-32°C during testing. It began to rain at the end of September. The experimental planting area was about 800 m 2 , the planting density was 12 plants/m 2 , and the variety was "Xianyu 335". The maize ear samples were collected every 7 d during the grain-filling stage. Each time, 10 maize ears were sampled and were quickly moved to the laboratory and stored at a low temperature to minimize the water loss after harvesting in high temperatures. Chemical testing and spectra acquisition were completed in the shortest time possible to minimize any external influences on the predictive model. The middle 200 grains of the maize ears were obtained; some grains were dried to measure the moisture content and some were used to collect the spectral data; the rest were dried naturally.
The spectrometer was a WQF-600N Fourier transform near-infrared spectrometer (FTNIR) (Beijing Rayleigh) with a wavelength range of 4000 cm -1 to 10 000 cm -1 . Each sample was scanned 32 times and the average values were obtained. For obtaining the spectral data, the grains were ground with a mill to use as few grains as possible. The sample pool was filled with the ground grain, as shown in Figure 1. More than 100 spectral signatures curves were obtained. After the abnormal samples were eliminated, 50 samples were used for the modeling set and the remaining 50 samples were used for the predicting set.
a. The sample pool before loading b. The sample pool after loading Figure 1 Sample pool for the collection of the spectral data A secondary drying method was used to obtain the chemical reference values. The initial drying temperature was 105°C for 2 h; then the temperature was kept constant at 85°C until the quality of the 100 grains did not change. The entire drying process typically required about 12 h. The calculation of the moisture content is shown in Equation (1): (1) where, WC is the percentage of the moisture content of the 100 grains, %; FW is the fresh weight of the 100 grains, g; and DW is the dry weight of the 100 grains, g.

Bootstrap
The bootstrap algorithm is based on the strategy of resampling. The simulated NIRS dataset possessed all the characteristic of the raw NIRS dataset and resulted in a better sample distribution. In this algorithm, the iterated dataset is added to the original dataset and the iterated weights are recalculated. The flow of the algorithm is shown in Figure 2.

SPXY
In the SPXY, the distance between the sample of interest and all other samples in the spectrum vector and concentration vector is calculated. The SPXY algorithm is suitable for divided samples in the NIRS quantitative analysis and is a classic sample optimization algorithm. In recent years, researchers have used it to optimize the selection of NIRS quantitative samples and have achieved good model results [40] .

PLS and the optimal factor number
When establishing a quantitative regression model based on PLS, the number of factors directly affects the model outcome. If the number of factors is too small, some information in the spectrum may be lost and under-fitting may occur; if the number of factors is too large, noise is introduced into the model, resulting in over-fitting of the model. Both under-and over-fitting cause large prediction errors [41] . The number of PLS model factors is one of the key elements in developing a robust model for different spectral samples, different crop growth stages, and different moisture contents.

Figure 2 Flowchart of the bootstrap algorithm
In order to ensure the accuracy and stability of the mathematical model and to prevent under-fitting or over-fitting, two interaction verifications are used to determine the optimal number of factors. The first is an analysis of the interaction between the root mean square error computed from the cross-validation (RMSECV) and the number of factors in the case of different moisture contents and different number of spectral samples. The second is an analysis of the interaction between the R 2 and the number of factors in the case of different moisture contents and different number of spectral samples.

Sub-model screening
The performance of the sub-model is usually evaluated by the RMSECV, the root mean square error of prediction (RMSEP), and the R 2 values. The larger the R 2 value, the stronger the prediction ability of the model is; the values of the RMSECV and RMSEP are relatively small and are consistent. If the RMSECV is much larger than the RMSEP value, it means that the representative sample is poor; if the RMSECV is far less than RMSEP, it indicates poor representativeness of the modeled sample and the information cannot be fitted adequately or is over-fitted.
In order to enhance the prediction accuracy of the model, in addition to using the parameters of R 2 , RMSECV, and RMSEP to screen the model, we integrated the three model evaluation parameters to create the screening threshold RVP of the model, where R represents the R 2 , V represents the RMSECV, and P stands for RMSEP. The formula for calculating the threshold RVP is shown in Equation (2). RVP i is the threshold of the i th sub-model.
A threshold RVP of less than 1.99 indicates that the RMSECV has a large deviation from the RMSEP value or the R 2 value is less than the average value of the sub-model.
In the sub-model screening, we delete the sub-mi, which has an RVP i < 1.99, the minimum R 2 value, and the maximum RSECV value and RMSEP.

Bootstrap-SPXY-PLS mositure monitoring model based on a small sample set
The monitoring model for determining the change in the moisture content is based on the bootstrap method, SPXY, and PLS regression using small sample sets, as shown in Figure 3. In the model, the combination of resampling and sample division is used to create the data set that meets the needs of the sample analysis and modeling. When the number of samples is between 10 and 50, the prediction accuracy and robustness of the model can be guaranteed and the accuracy of the model can be improved. The bootstrap resampling algorithm repeatedly simulates a small sample data set and uses sample merging to ensure the differences between the new data set samples, thereby creating a resampled set. The SPXY algorithm performs optimization screening of the sample set using a re-extraction strategy to form a model set of multiple subsets. Bootstrap resampling and sample optimization of the SPXY are performed to form a set of modeling samples composed of multiple subsets. Then, one subset is randomly selected for pre-modeling based on the PLS and the RMSECV and R 2 values of each factor are recorded. By determining the mean square error of the interaction and the R 2 of the model for the different factors, the optimal number of factors for establishing the model is obtained. At this stage, the RMSECV is small and the R 2 value is large and the two are balanced. The optimal number of factors is also the optimal number of factors for the regression prediction. Based on the number of samples and the optimal modeling factor of the moisture content monitoring model in different grain-filling stages, a PLS sub-model based on the best factor number is created for the modeling subset and the sub-mi is determined, where i=1, 2 , 3,...,n. The means of the R 2 , RMSECV, and RMSEP of the sub-mi are calculated and the RVP threshold of the sub-mi is calculated using Equation (2). The sub-models are screened. The sub-model among the subset of moisture content monitoring models with the minimum R 2 value, or the maximum RMSECV and RMSEP values, or the threshold RVP value of less than 1.99 is deleted. Finally, the selected sub-mi is used to perform the regression prediction using the optimal number of factors of the model. The prediction result is used as a subset of the prediction results (predicted i) and the prediction results of the moisture content of the different samples in the different grain-filling stages are obtained. The mean value of the set of predicted values is used as the predicted moisture content during the corresponding grain-filling stage.

Results and discussion
3.1 Near-infrared spectroscopy data and chemical reference value Figure 4 shows the Fourier transform near-infrared spectra at the maize grain-filling stage after pre-processing with a Savitzky-Golay filter with a window size of 13. The seven spectral curves represent the average values of the different stages. And The water absorption spectrum ranges from 6900 cm -1 to 7900 cm -1 . The blue curve at the bottom represents data sampled on August 21st and the pink curve at the top represents data sampled on October 2nd. The spectral data represent the change in moisture of the maize kernels in the grain-filling stage.  Table 1. It is observed that the moisture content of the maize kernels decreases rapidly from August 21, 2016 to September 4; after this date, the precipitation decreases. Because of the moderately rainy weather on the day before sampling on October 2, the moisture content of the maize kernels is higher than in the samples collected on September 25.

Analysis of Bootstrap-SPXY-PLS Moisture content Monitoring Model
The data that were resampled 500 times were considered the optimal dataset [39] , i.e., count_max=500. The sample size was 10, 20, and 50, referred to as X_ten, X_twenty, and X_fifty, respectively. After resampling, the datasets are referred to as X*_ten, X*_twenty and, X*_fifty, respectively. The data were processed using the SPXY method to create the modeling sets. Then, sub-models were created by cross-validation. Figure 5b shows the spectra of one sub-model of the data obtained on September 11. The characteristics of the two spectra were identical but the resampled spectra represent the optimal dataset with regard to the T values. However, the noise was retained. a. Resampled spectra b. Raw spectra Figure 5 Comparison of raw spectra and resampled spectra obtained on September 11 The results of the screened and unscreened moisture content monitoring models for the different sample sizes at the grain-filling stage are shown in Table 2. The R 2 , RMSECV, and RMSEP-mean values indicate better performance of the screened model and the values of r p are larger. However, when the number of samples is 10, the r p value is lower. For the sample sizes of 20 and 50, the R 2 values of the models for all grain-filling stages are larger than 0.98 and the average relative increase is 0.27% and 0.24% (September 11) and 0.13% and 0.11% (September 25). The improvements are more apparent for a sample size of 10. The R 2 value of the optimized model of each grain-filling stage is 0.5% higher than that of the unscreened model. The lowest R 2 value of the model is 0.9397 or higher and the relative deviation of the mean value of RMSECV and RMSEP has decreased by 0.04%.
These results demonstrate that the model has good predictive ability and the predictive ability improves after screening. It can be seen from the r p value that the number of samples does affect the stability of the model prediction.

Selection of the optimum number of factors for different sample sizes
The number of factors is one of the key parameters to ensure the robustness of the Bootstrap-SPXY-PLS model because using the optimum number of factors improves the predictive ability of the model. After optimizing the sample set formation, we first select the subsets of the different sample sizes and the different grain-filling stages to create the corresponding pre-model as the best analysis model. By plotting the RMSECV and R 2 trend graphs of the different models, the number of different samples and the optimal number of factors for the different grain-filling stage are obtained. Figure 6 shows the RMSECV and R 2 values of the model for the different grain-filling stages, different sample sizes, and different factor numbers. Figure 6a shows the trend of the RMSECV and R 2 values based on the Bootstrap-SPXY-PLS pre-model sampled on September 11 for the sample sizes of 10, 20, and 50.
The overall trend of the RMSECV value can be divided into three parts. The first part ranges from factor = 0 to the factor of the maximum RMSECV value, which is 5 for the sample sizes of 10 and 20 and 7 for the sample size of 50. However, the R 2 value is lower when the factor number is less than 5 or 7. When the sample size is 20, the R 2 value is only 0.2 and the maximum is only about 0.8 when the sample size is 10 (line 1 in Figure 6a). The second part of the RMSECV value gradually decreases and the R 2 value gradually increases. When the sample size is 10, the RMSECV falls to the minimum value when the factor is equal to 8 and the R 2 value is 0.96 (line 2 in Figure 6a). When the sample size is 20 and 50, the RMSECV values of the factor of 10 are the minimum values in this part and the corresponding R 2 values are 0.91 and 0.99, respectively (line 3 Figure 6a). In the third part, the RMSECV value rises again, resulting in a poor prediction performance of the model, although the corresponding R 2 value increases. Therefore, based on the RMSECV and R 2 results, the factor at which the lowest RMSECV value was obtained in the second part is the optimal factor for the model. Figure 6b shows the RMSECV and R 2 values at a moisture content of 27% sampled on September 25. Due to the decrease in the moisture content, the trend of the RMSECV value is different from that shown in Figure 6a. As the number of factors increases, the RMSECV curve increases stepwise. It is also possible to divide the curve into three parts. The first part consists of the factor numbers less than or equal to 5 at sample sizes of 10, 20, and 50, respectively. The second part consists of the factor numbers less than or equal to 10. The third part consists of the factor number larger than 11. By comparing the R 2 trend, a similar conclusion can be drawn that the minimum value of the second part of the RMSECV curve corresponds to the optimum factor number. The factor numbers of the sample sizes of 10, 20, and 50 are 10, 9, and 10, respectively and the corresponding R 2 values are 0.963, 0.983, and 0.987, respectively.
It can be concluded from the data shown in Figure 6 that for different moisture contents, the RMSECV value of the model stabilizes after a certain change for the small sample sizes of 10, 20, and 50 samples.
If a larger factor number is not suitable for a model, the R 2 results of the pre-modeling and the range of the best factor number based on the RMSECV value can be used to obtain the optimum number of factors for the different grain-filling stages and different sample sizes.
The R 2 and RMSECV values shown in Figure 6 indicate that the different moisture contents and different sample sizes have an influence on the optimum number of factors in the model. Therefore, the predictive ability of the model was improved by determining the optimal number of factors prior to creating the sub-model as shown in Table 3.  3.4 Screening of the moisture content monitoring sub-models at the grain-filling stage for different sample sizes Tables 4 and 5 respectively show the screening results of the sub-models for the September 11 and September 25 sampling dates for different sample sizes.
All sub-models were evaluated and screened using the R 2 , RMSECV, RMSEP-mean, and the threshold RVP.
The sub-models with a minimum R 2 , the maximum RMSECV and RMSEP-mean, and a threshold RVP of less than 1.99 were removed for the seven grain-filling stages. The sub-model screening results are shown in Table 6.

Comparison of the prediction values and reference values
The prediction results and the moisture content changes during the grain-filling stage are shown in Figure 7. At a sample size of 50, the average deviation from the reference value of the 7 predicted moisture contents at the grain-filling stage is 0.0918% and the maximum deviation is 0.3095% on September 18. At a sample size of 20, the average deviation is 0.057% and the maximum deviation of 0.1805% is predicted for October 2. At a sample size of 10, the average deviation is 0.1078% and the maximum deviation occurs on September 18th with a moisture content of 0.3036%. It is evident that the Bootstrap-SPXY-PLS optimization model for analyzing the NIRS data results in a small deviation between the prediction and the reference value of the moisture content during the grain-filling stage. The reference and prediction curves nearly coincide. However, for the drying method, we only tested the samples in the laboratory and did not predict the parameters of other samples. The NIRS method provides rapid prediction and non-destructive testing can be conducted, which is beneficial for maize breeding research.

Conclusions
(1) We conducted a quantitative analysis of the moisture content of maize kernels at multiple stages during the grain-filling stage and created a PLS regression model based on the Bootstrap and SPXY optimization method for the sample sizes of 10, 20, and 50. The results indicated that the screened model had better performance. The screened model was developed for predicting the maize moisture content during the grain-filling stage based on the optimal number of factors. The results demonstrated that the predicted and reference values of the maize moisture content during the grain-filling stage were similar for the sample sizes of 10, 20, and 50. The average deviations between the predicted values and reference values were 0.1078%, 0.057%, and 0.0918%, respectively.
(2) By determining the R 2 and RMSECV values of the PLS sub-models, the optimal number of factors of the model for different grain-filling stages and different sample sizes were obtained. The results proved that the model evaluation index provided the optimal number of factors and, therefore, better prediction results.
(3) The comprehensive evaluation parameter RVP for the sub-model screening was proposed. The RVP, R 2 , RMSECV, and RMSEP-mean values were used as comprehensive screening parameters for the sub-model. The experimental results showed that the R 2 values after screening showed an average relative increase of 0.5%, 0.27%, and 0.24% for the sample sizes of 10, 20, and 50.
The results of the regression prediction using the Bootstrap-SPXY-PLS optimization model on small sample set indicated that the moisture content can be predicted and monitored at the maize grain-filling stage. The model is based on destructive sampling but uses only one-tenth of the number of maize grains used for the model based on non-destructive sampling. Using a low number of maize grains at the filling stage is very important for crop breeding because the maize grains have not reached maturity. This is an advantage of the model used in this study based on destructive sampling. There is a certain amount of water loss during grinding. Therefore, in the future, we will conduct in-depth research on data processing algorithms using a small sample size. We will also investigate the use of multi-spectral data and methods suitable for collecting spectral information on maize ears directly in the field. The results of this study provide a new method for moisture content monitoring during crop growth stages using NIRS.