MapReduce Parallel Implementation of Improved K-means Clustering Algorithm on Spark Platform

Huang Suyu, Tan Lingli

MapReduce Parallel Implementation of Improved K-means Clustering Algorithm on Spark Platform

Download as PDF

DOI: 10.25236/csam.2019.069

Author(s)

Huang Suyu, Tan Lingli

Corresponding Author

Huang Suyu

Abstract

Cloud Computing is the development of Distributed Computing, Parallel Computing and Grid Computing. Cloud computing is a new distributed parallel computing environment or mode. The emergence of cloud computing makes the networking and service of data mining technology become a new trend. Clustering is different from classification. In the classification model, there are sample data whose class labels are known. The purpose of classification is to extract classification rules from the training sample set for class identification of objects whose class labels are unknown. In clustering, it is necessary to divide all data objects into clusters according to some measure without knowing the information about the classes of the target data in advance. Therefore, cluster analysis is also called unsupervised learning.

Keywords

Spark Platform, K-Means Clustering, Mapreduce, Parallelization