Spectroscopic measurement approaches in evaluation of dry rubber content of cup lump rubber using machine learning techniques

Dry rubber content (DRC) is an important factor to be considered in evaluating the quality of cup lump rubber. The DRC analysis requires prolonged laboratory validation. To develop fast and effective DRC determination methods, this study proposed methods to evaluate the DRC of cup lump rubber using different spectroscopic measurement approaches. This involved a complete fundamental analysis leading to an efficient measurement method based on either point-based measurement using NIR reflectance spectrometer or area-based measurement using hyperspectral imaging. A dataset was prepared that 120 samples were randomly divided into a calibration set of 90 samples and a validation set of 30 samples. To obtain an average spectrum to represent a cup lump rubber sample, the spectral data were collected by locating and scanning for point-based and area-based measurement, respectively. The spectral data were calibrated using partial least squares regression (PLSR) and the least-squares support vector machine (LS-SVM) methods against the reference values. The experiments showed that the area-based measurement approach with both algorithms performed outstandingly in predicting the DRC of cup lump rubber and was clearly better than the point-based measurement approach. The best predictions of PLSR represented by the coefficient of determination (R), the root mean square error of prediction (RMSEP) and the residual predictive deviation (RPD) were 0.99, 0.72% and 15.17, while the best prediction of LS-SVM were 0.99, 0.64% and 16.83, respectively. In summary, the area-based measurement based on the LS-SVM prediction model provided a highly accurate estimate of the DRC of cup lump rubber.


Introduction 
Natural rubber is an important economic crop in southeast Asia. Thailand is one of the leading producers of natural rubber with about a 35% share of the total production. According to the Office of Agricultural Economics [1] , Thailand exported 2.2 million t of natural rubber in 2019 which provided revenue from both export and employment. Natural rubber is used extensively by many manufacturing companies in the rubber industry, either alone or in combination with other materials. Currently, farmers produce rubber not only in the form of latex but also in various types of rubber products, for example Standard Thai Rubber (STR), ribbed smoked sheet (RSS), air dried sheet (ADS), crepe, skim rubber and concentrated latex. STR 20 grade is a high grade of STR rubber and makes up approximately 60% of all STR rubber in the export market. Due to its physical properties, this high-graded rubber (STR 20) is suitable for manufacturing high quality rubber products for the automobile industry, for surgical and pharmaceutical use and molded and extruded rubber products in general.
STR 20 rubber originates from cup lump rubber under a quality control process. In the major areas of Thailand, cup lump rubber is formulated by coagulating fresh latex with acid in a collection cup. The coagulated cup lumps are collected daily and processed into dry forms for trading. The process to produce cup lump rubber is uncomplicated with respect to other forms of natural rubber such as ribbed smoked sheet or air dry sheet rubber; consequently, most rubber farmers prefer to sell their natural rubber as cup lump rubber rather than in other forms. The price of cup lump rubber depends mainly on its dry rubber content (DRC). In the current market, traders estimate roughly the DRC of cup lump rubber by seeing and touching samples without reference to any standards. The resultant difficulty in applying a consistent approach motivated the current research to determine a model that could predict the DRC of cup lump rubber rapidly and precisely.
Near-infrared spectroscopy (NIRS) is a type of vibrational spectroscopy based on the wavelength range of 750 to 2500 nm (wavenumbers: 13300 to 4000 cm -1 ).
As a fast and non-destructive method, NIRS has been widely used for rapid analysis of the moisture, protein and fat content of a wide variety of agricultural and food produce [2,3] . Also, it has been proven ability to quantify trace amounts of moisture in a raw rubber sheet. The results indicated that a strong combination of absorption bands for water was at approximately 1940 nm and the first, second and third overtones were at 1450 nm, 970 nm and 760 nm, respectively [4] . NIRS has also been used successfully to evaluate the moisture content of natural rubber in various forms such as cup lump rubber [5] and concentrated latex [6] .
Hyperspectral imaging, like other spectroscopic techniques, is usually carried out in the Vis-NIR (400-1000 nm) or NIR (1000-1700 nm) ranges for food and agricultural produce [7] . Hyperspectral images, known as a hypercube [8] , are made up of hundreds of contiguous wavebands in the spatial dimension of the target being studied. The hypercube allows for visualization of the biochemical constituents of a sample, separated into given areas of the image, depending on their spectral properties. In recent years, there has been a rapid expansion in the use of hyperspectral imaging in various analytical processes over a broad range of food and agricultural produce such as poultry [9] , pork [10] and fruits [11] . The current study proposed a point-based approach using NIR reflectance spectrometry and area-based measurement using a hyperspectral imaging system to demonstrate the capabilities of hyperspectral imaging compared to the normal spectroscopic measurement method. Besides, two algorithms (partial least squares regression (PLSR) and least-squares support vector machine (LS-SVM)) were compared to assess the performance of the proposed models for assessing the DRC of cup lump rubber.

Cup lump rubber preparation
The fresh latex was sampled from a rubber plantation in eastern Thailand in October 2017. In total, 120 cup lump rubber samples were prepared from fresh latex. Each sample was coagulated using 0.3% diluted formic acid which took approximately 30 min. At this stage, the liquid serum inside the cup lump rubber was released and separated off. After the cup lump rubber had completely coagulated, all cup lump rubber pieces were weighed to determine the initial mass. The dry matter of each cup lump rubber was calculated by multiplying its initial mass by the DRC of fresh latex measured according to the laboratory method defined by AOAC standards [12] . All samples were stored and left to dry at ambient temperature for 1 d, 3 d, 5 d, 7 d, 9 d and 12 d. In each period, 20 samples (15 samples for calibration and 5 samples for validation) were weighed again for their final mass. Then, the DRC (%) was calculated using Equation (1). Table 1 shows the variation in the DRC of cup lump rubber in each period, with the DRC increasing with increasing storage time.

Spectral measurement methods
The cup lump spectrum was collected using two measurement approaches. The first approach, called point-based measurement, used NIR reflectance spectrometry in the range 901.35-1700.64 nm with 3.9 nm resolution produced by Texas Instruments, Inc., USA.
The NIR reflectance spectrometer was placed on the surface of the cup lump rubber sample to acquire a spectrum at four different positions as shown in Figure 1a. Each measurement point was based on an average of 6 individual readings. Afterward, the total spectrum was collected. Readings were averaged and transferred to the Matlab software package for spectral preprocessing and multivariate analysis. The second approach called area-based measurement used the hyperspectral imaging system manufactured by SPECIM, Spectral Imaging Ltd., Finland.
The system was composed of a temperature-stabilized InGaAs camera equipped with an imaging spectrograph, a fore objective lens (OLES15), an illumination unit (tungsten halogen lamps, 20 W), a translation stage and a computer equipped with the data acquisition and control software. The hyperspectral imaging system was configured to operate spatially at 10 mm/s. The illumination unit was inclined at 35 degrees to the vertical to allow the proper direction of the reflected light to the camera lens. The optimal spectrum came from scanning the surface of the cup lump rubber can be tuned by adjusting the camera position and focal length of the lens as shown in Figure 1b. It provided 256 bands through a wavelength range from 864.53 to 1695.08 nm with 3.3 nm spectral resolution and 320 pixels for spatial resolution. White and dark reference spectral images were acquired in each image acquisition. The raw reflectance image was based on Equation (2).
where, R is the raw reflectance image; I 0 is the sample image; I d is the dark image acquired by completely closing the electro-mechanical shutter of the camera; I w is the white reference image, which collects 99% diffuse reflectance from the white standard. The image spectra were stored in the computer using the SPECIM software interface and analyzed using the Matlab software.

Spectral preprocessing
In reflectance mode, the reflected light travels different distances from the sample to the detector. A longer light path produces a lower relative reflectance value. This causes spectral translation and affects the spectral model. Pretreatment of spectra is a required part of spectral analysis and can improve the accuracy of analysis [13] . For this reason, spectral pretreatment was applied to the cup lump rubber spectra to reduce problems associated with noise, light-scattering and external effects prior to implementing the regression analysis. To reduce the light scattering, the signal normal variate (SNV) transformation [14] was primarily necessary for spectral pretreatment before applying the other spectral transformations. The first and second derivative transformations were used to remove the additive and multiplicative effects in the spectra [15] . Thus, a combination of spectral pretreatments was selected (SNV, SNV followed by the first derivative and the SNV followed by the second derivative) to compare their influences on model performance. Figure 2a shows the pre-processing procedures for the point-based measurement. The aim was to obtain the average spectrum of each sample from 4-point spectral measurement using different kinds of pretreatment methods. Area-based measurement as shown in Figure 2b  considering the spatial dimension of the spectrum. All spectra of images were first preprocessed, and an average spectrum of the cup lump image was obtained based on a background subtraction using the thresholding technique, because of its simplicity and lower computational load. Inspection of the spectra indicated that at approximately 1085 nm, 1274 nm and 1450 nm, the reflectance values of the sample and the background were clearly different, as shown in Figure 3a. At these wavelengths, the threshold values were defined in terms of the upper and lower limits of the reflectance value that covered only the cup lump rubber spectra. Therefore, if a given pixel spectrum had a reflectance value in the range, that pixel was picked up as a region of interest (ROI). The averaged ROI was used to represent the spectrum of a cup lump sample. Finally, all average spectra obtained from both point-based and area-based measurements were extracted in the range 942 to 1650 nm for further analysis, because outside this range strong noise was found in the spectrum. Figure 3b-d shows a group of the average spectra preprocessed using different kinds of spectral pretreatments in the calibration set prepared for model development.

Calibration model 2.4.1 Partial least squares regression
The PLSR model has become the most frequently applied in NIR spectroscopy [16] .
The PLSR algorithm can effectively remove the co-linearity problem of spectral data [17] . A PLSR model expresses the linear relationship between the independent variables or spectral variables (X) and the dependent variables or reference values (Y). The optimal number of latent variables (LVs) is a key parameter in construction of calibration model and directly affects the predictive performance of the model. The maximum LVs is accepted at approximately 10 [18] . However, a low number of LVs is desirable to avoid the inclusion of signal noise in the model [19] , whereas, on the other hand, using more LVs in the model causes over-fitting problems. The optimal number of LVs should be selected to avoid over-fitting and to maximize the covariance between the X and Y space. In this study, the PLSR algorithm was implemented on Matlab R2017a (The Math Works, Natick, Ma, USA).

Least-squares support vector machine
The support vector machine (SVM) was primarily designed for data classification tasks [20] . The modified version of SVM called LS-SVM [21] is an easier, robust approach for the classification and regression analysis of linear and nonlinear multivariate problems. Recently, it has been widely used in the area of chemometrics [22][23][24][25] and hyperspectral imaging applications [26][27][28] . In the current study, LS-SVM was used to create a calibration model for predicting the dry rubber content of cup lump rubber. The model parameters of LS-SVM regression were found by solving the optimization task shown in Equation (3).
subject to ( ) , 1,..., where, e i is the error variable; γ is the regularization parameter which penalizes the error; ()  is kernel space mapping function that mapping the sample in origin space to a vector in the high dimensional feature space; w is the weighting vector in the dimension of feature space; b is the bias. According to the optimization equation, Lagrange function is obtained as follows: where, a i are the Lagrange multipliers. When the Lagrange equation and the Karush-Kuhn-Tucker conditions for optimality are combined, a set of linear equations is formulated and solved. Eventually, the LS-SVM regression model can be expressed as Equation (5): where, K(x, x i ) is the kernel function that in the current study was a Gaussian radial basis function (RBF) as shown in Equation (6): This algorithm was tested in Matlab R2017a software using the LS-SVMlab toolbox [29] .

Model evaluation
A comparison of the reliability and efficient prediction from the above-mentioned models was based on the following statistics. The coefficient of determination (R 2 ) provides information on the goodness-of-fit of a model. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. The root means square error (RMSE) represents the mean absolute error of the time-series calculated by the model between the reference y i and predicted ˆi y values. The residual predictive deviation (RPD) is a goodness-of-fit measure that can be used to analyze how well a set of data points fits with the actual model. The RPD is calculated by dividing the standard deviation (SD) by RMSE. Interpretation of the RPD depends on the material analyzed. According to certain literature, if the RPD is less than 2, the model performance is insufficient for prediction, but there is room for improvement [30] . An RPD between 2 and 5 provides adequate accuracy for good estimations, an RPD between 5 and 10 indicates high accuracy and is suitable for analysis analogous to that of the reference method and if the RPD is higher than 10, the model shows excellent performance [31,32] . In general, a good model should yield high values of R 2 and RPD while producing low values of RMSE. The parameters used are presented in Equations (7)-(9).

Results and discussion
The entire dataset of 120 samples used in this experiment was randomly divided into a calibration set of 90 samples for developing the models and a validation set of 30 samples for model testing. Suitable LVs or latent variables must be identified in the development of the PLSR model. In general, the optimal LV is discovered using a cross-validation technique [33] . In the current study, leave-one-out cross-validation (LOOCV) was implemented to locate optimal LVs. This algorithm will partition data into k randomly chosen subsets, where k is equal to the total number of observations in the calibration set so that all data will be used once as a test set. Models were created in each round of a given test set with the increase in the number of LVs in each being one. The LVs that resulted in the minimum mean square error (MSE) were chosen as the optimal LVs of the model. In this study, the PLSR models originating from the different kinds of spectral pretreatment (SNV, SNV + the first derivative, and SNV + the second derivative) had optimal LVs of 10, 10 and 8 for the point-based measurement and had optimal LVs of 12, 10 and 15 for the area-based measurement, respectively. Unlike the PLSR model, the LS-SVM model used Gaussian RBF as the kernel function and needed two parameters, namely the squared bandwidth of the Gaussian curve (σ 2 ) and the regularization parameter (γ) that determines the trade-off between training error minimization and smoothness [29] . These parameters were tuned to search for the optimal value using coupled simulated annealing for initial estimation and the simplex method for fine tuning. Table 2 shows the tuning parameters for the point-based and area-based measurements based on different kinds of spectral pretreatment. The performance of all proposed models was compared based on validation set testing and the results are listed in Table 3 indicates both the PLSR and LS-SVM models provided a good prediction of dry rubber content based on the respective values of RMSEP, R 2 and RPD for both measurement approaches, because there were high variations of reflectance values at 960-1000 nm, at approximately 1200 nm and 1450-1600 nm when the rubber content differed. At approximately 970 nm and 1450 nm, they could be associated with the second and first overtones of the O-H vibration of water [34,35] . The rubber content as the main ingredient of cup lump rubber is a polymer partially constructed with C-H bonding that was attributed to the second overtone vibration at approximately 1200 nm [6] , as shown in Figure 4a. This allowed the calibration models to capture those differences and to subsequently accurately estimate the DRC of the cup lump rubber. In addition, the predictive ability of the area-based measurement method was significantly better than that of the point-based measurement, because the spectrum shape derived from area-based measurement was more complete than from the point-based approach that generated deviation in the spectrum. These differences are evident by comparing point-based measurement with various spectral pretreatments in Figures 4b, 4d and 4f with those of area-based measurement in Figures 4a, 4c and 4e. For example, the spectrum shapes in Figure 4e at approximately 1000-1100 nm, 1200-1400 nm and 1500-1600 nm were smoother and clearer than the spectrum shapes in Figure 4f.
For point-based measurement, the models with SNV pretreatment had the best accuracy with both the PLSR and LS-SVM algorithms perhaps because there was less deviation in the spectra than for others. In contrast, for the area-based measurement, the spectra pre-processed using SNV followed by the 2 nd derivative produced the best results, because their spectra were more explicit so that the additional derivative technique could effectively reduce additive and multiplicative effects in the spectrum.
Therefore, for point-based measurement, the PLSR model with SNV provided the best prediction with values of R 2 = 0.96, RMSEP = 2.20% and RPD = 4.84, while the best prediction values using LS-SVM were 0.97, 1.84% and 5.61, respectively. For area-based measurement, the PLSR model with SNV+2 nd derivative represented the best prediction with values for R 2 , RMSEP and RPD of 0.99, 0.72% and 15.17, respectively, and the best prediction values using LS-SVM were 0.99, 0.64% and 16.83, respectively. Additionally, the results of area-based measurement had very high RPD values (RPD > 10), indicating that the models were acceptable for use in all analytical tasks.  Figure 4 Average spectra of cup lump rubber for different degrees of DRC The results indicated that the LS-SVM model was slightly better than the PLSR model. The nonlinearity of the LS-SVM model meant that it had higher sensitivity than a PLSR model so that the LS-SVM models could track the spectra more closely than the PLSR models and eventually predicted the DRC better than the PLSR models.
These findings confirmed that a nonlinear regression method such as LS-SVM was a good alternative algorithm when dealing with spectroscopy. The scatter plots in Figure 5 depict the best outcomes of the PLSR and LS-SVM models for both point-based and area-based measurements. These plots show that models derived from area-based measurement could predict the percentage of DRC closely.

Conclusions
The experiments showed that the area-based measurement method performed outstandingly in predicting the DRC of cup lump rubber and was more efficient than the point-based measurement. The best prediction of PLSR had values of R 2 =0.99, RMSEP = 0.72% and RPD = 15.17, while the best prediction of LS-SVM had values of 0.99, 0.64% and 16.83, respectively. Although the point-based measurement method could predict the DRC of cup lump rubber with good accuracy, it still had problems of reliability and repeatability due to the lack of a spatial dimension. On the other hand, the area-based measurement approach was able to accommodate the spatial dimension which resulted in obtaining extensive spectral data from a sample so that this approach could provide more accurate results. Nevertheless, data analysis using area-based measurement needs more computational resources and consumes extra processing time due to its large spectral dataset and includes many processing steps such as image processing, spectral processing and multivariate regression.
In addition, the hyperspectral imaging system was not portable and was quite costly. However, its implementation would be possible in a large-scale rubber industry with a short payback period. Finally, this study confirmed the potential of the area-based measurement method using hyperspectral imaging and machine learning techniques for predicting the DRC with very high accuracy.