Association rule mining algorithm based on Spark for pesticide transaction data analyses

Xiaoning Bai, Jingdun Jia, Qiwen Wei, Shuaiqi Huang, Weicheng Du, Wanlin Gao

Abstract


With the development of smart agriculture, the accumulation of data in the field of pesticide regulation has a certain scale. The pesticide transaction data collected by the Pesticide National Data Center alone produces more than 10 million records daily. However, due to the backward technical means, the existing pesticide supervision data lack deep mining and usage. The Apriori algorithm is one of the classic algorithms in association rule mining, but it needs to traverse the transaction database multiple times, which will cause an extra IO burden. Spark is an emerging big data parallel computing framework with advantages such as memory computing and flexible distributed data sets. Compared with the Hadoop MapReduce computing framework, IO performance was greatly improved. Therefore, this paper proposed an improved Apriori algorithm based on Spark framework, ICAMA. The MapReduce process was used to support the candidate set and then to generate the candidate set. After experimental comparison, when the data volume exceeds 250 Mb, the performance of Spark-based Apriori algorithm was 20% higher than that of the traditional Hadoop-based Apriori algorithm, and with the increase of data volume, the performance improvement was more obvious.
Keywords: Spark, association rule mining, ICAMA algorithm, big data, pesticide regulation, MapReduce
DOI: 10.25165/j.ijabe.20191205.4881

Citation: Bai X N, Jia J D, Wei Q W, Huang S Q, Du W C, Gao W L. Association rule mining algorithm based on Spark for pesticide transaction data analyses. Int J Agric & Biol Eng, 2019; 12(5): 162–166.

Keywords


Spark, association rule mining, ICAMA algorithm, big data, pesticide regulation, MapReduce

Full Text:

PDF

References


Li D L. Internet of things and wisdom agriculture. Agricultural Engineering, 2012; 2(1): 1–7. (in Chinese)

Zhao C J. Intelligent agriculture prospects, digital technology will have a new future. Marketing (Agricultural Resources and Markets), 2018; 18: 59–61. (in Chinese)

Zhang M, Jin Y H, Zheng F T. Investigating the “One Farm Household, Two Production Systems” in rural China: The case of vegetable and fruit farmers. Annual Meeting of Agricultural and Applied Economics Association (AAEA), Boston, Massachusetts, July 31-August 2, 2016.

Yan M J, Luo J, Liu J Y, Hou C W. IABS: parallel improved Apriori algorithm based on Spark. Application Research of Computers, 2017; 34(8): 2274–2277.

Zhou Z H, Yang Q. Machine learning and its applications. Beijing: Tsinghua University Press, 2011.

Apache Hadoop. http://hadoop.apache.org/.

Lin J X, Huang Z. An improved Apriori algorithm based on array vectors. Computer Applications and Software, 2011; 28(5): 268–271.

Cao Y, Miao Z G, Zhang H X. Application research about degree warning based on improved Apriori algorithm. Computer Development & Applications, 2014; 27(6): 1–3. (in Chinese)

Zhao X J, Sun Z X, Yuan Y, Chen Y. An improved Apriori algorithm based on orthogonal list storage. Journal of Chinese Computer Systems, 2016; 37(10): 2291–2295. (in Chinese)

Mahout. http://mahout.apache.org/.

Distributed computing. http://baike.baidu.com/view/7011548.htm.

Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008; 51(1): 107–113.

Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, April 25-27, 2012.

Mllib. http://spark.apache.org/mllib/.

Low Y C, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein J. Graphlab: A new framework for parallel machine learning. https://arxiv.org/ftp/arxiv/papers/1408/1408.2041.pdf.

Berkhin P. A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin Heidelberg, 2006; pp.25–71.

Hu S J. Parallel data mining algorithm research in cloud. Chengdu: University of Electronic Science and Technology of China, 2013. (in Chinese)

Tian S P, Wu W L. Algorithm of automatic gained parameter value k based on dynamic k-means. Computer Engineering and Design, 2011; 32(1): 274–276. (in Chinese)

Suo H G, Wang Y W. Reference-based k-means algorithm for document clustering. Computer Engineering and Design, 2009; 2: 401–403,407. (in Chinese)

He Z, Qian J S. A multicenter clustering algorithm for automatic acquisition of K values. Electronics World, 2012; 4: 60–64. (in Chinese)

Bloodgood M, Ye P, Rodrigues P, Zajic D, Doermann D. A random forest system combination approach for error detection in digital dictionaries. Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 2012; pp.78–86

Mahdi M U. Determining number and initial seeds of K-means clustring using GA. Journal of Babylon University/Pure and Applied Sciences, 2010; 18(3): 1–6.

Lu S L, Lin S M. Distance-based outliers detection and applications. Computer and Digital Engineering, 2004; 32(5): 94–97. (in Chinese)

Granitto P M, Furlanello C, Biasioli F, Gasperi F. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 2006; 83(2): 83–90.

Zheng L Z, Huang D C. Outliers detection and semi-supervised clustering algorithm based on shared nearest neighbors. Computer Systems and Applications, 2012; 21(2): 117–121. (in Chinese)

Ho T K. The random subspace method for constructing decision forests. EEE Trans. Pattern Analysis and Machine Intelligence, 1998; 20(8): 832–844.

Genuer R, Poggi J M, Tuleau-Malot C. Variable selection using random forests. Pattern Recognition Letters, 2010; 31(14): 2225–2236.

Shin M, Kang E M, Park S H. Automatically finding good clusters with seed k-means. Genome Informatics Series, 2003; pp.326-327.

Arthur D, Vassilvitskii S. K-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, 2007; pp.1027–1035.

Yong K. Research on feature selection and model optimization of random forest. Harbin: Harbin Institute of Technology, 2008. (in Chinese)

Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable k-means++. Proceedings of the VLDB Endowment, 2012; 5(7): 622–633.

Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006; pp.161–168.

Breiman L. Random forests. Machine Learning, 2001; 45(1): 5–32.

Bootstrap. http://en.wikipedia.org/wiki/Bootstrap_aggregating.




Copyright (c) 2019 International Journal of Agricultural and Biological Engineering

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

2023-2026 Copyright IJABE Editing and Publishing Office