MapReduce Parallel Implementation of Improved K-means Clustering Algorithm on Spark Platform
Download as PDF
DOI: 10.25236/csam.2019.069
Author(s)
Huang Suyu, Tan Lingli
Corresponding Author
Huang Suyu
Abstract
Cloud Computing is the development of Distributed Computing, Parallel Computing and Grid Computing. Cloud computing is a new distributed parallel computing environment or mode. The emergence of cloud computing makes the networking and service of data mining technology become a new trend. Clustering is different from classification. In the classification model, there are sample data whose class labels are known. The purpose of classification is to extract classification rules from the training sample set for class identification of objects whose class labels are unknown. In clustering, it is necessary to divide all data objects into clusters according to some measure without knowing the information about the classes of the target data in advance. Therefore, cluster analysis is also called unsupervised learning.
Keywords
Spark Platform, K-Means Clustering, Mapreduce, Parallelization