NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud
Cloud services are widely deployed to store and process big data. Organizations who deal with big data, especially large document set, prefer utilizing cloud services for storage and computational efficiency. However, for processing large text corpus, an inefficient data processing is computationally expensive for real-time systems. In addition, efficient memory utilization is important to cluster big data including large text corpus. Clustering of the large text corpus is an important component of various document retrieval systems such as PubMed1. To address these challenges, in this paper, we present NoCS2 (Number of Cluster and Seed Selection) for efficient topic-based clustering from unstructured big data in the cloud. NoCS2 relies on computing and storage services in the cloud server. Traditional clustering solutions for text dataset consider a fixed number of clusters irrespective of the dataset size and characteristics such as science and technology. Alternatively, our solution dynamically determines the appropriate $k$ number of clusters based on the characteristics of the dataset. Particularly, we use precomputed matrix trace as the number of clusters for a dataset that represents the total number of keywords using vector representation. Then, we build $k$ clusters using topic-based similarity among keywords. Finally, we compare our proposed method with two state-of-the-art clustering methods. Empirical results demonstrate that the average closeness score of NoCS2 is better than other methods for large and sparse datasets.
S. M. Zobaed, Md. Enamul Haque, S. Kaiser, Razin Farhan Hussain