Scopus Indexed Publications

Paper Details


Title
Movie Subtitle Document Classification Using Unsupervised Machine Learning Approach
Author
Md. Mehedi Hasan, Imrus Salehin, Sadia Tamim Dip, Sonia Akter, T. M. Kamruzzaman,
Email
Abstract
Since the evolution of digital and online text content, automatic document classification has become a significant research issue. There is a most commonly used machine learning approach to improve this task: an unsupervised approach, where no human interaction or labelling documents are required at any point throughout the whole procedure. This study addressed an approach for movie subtitle document classification using an unsupervised machine learning technique. The dataset has been created, collecting almost 500 English movie subtitle files based on the popular movies of IMDB. Two feature extraction methods have been used and combined with unsupervised machine learning algorithms and a dimension reduction technique has been used to reduce the dimensionality of this work. As unsupervised machine learning techniques, we used Bisecting K-Means, K-Means and Agglomerative Hierarchical Clustering Algorithm; Average link, Single Link and Double link. We assessed that K-means and Bisecting k-means are the best performers of the unsupervised techniques in the term of cluster quality. We addressed the reason for the outliers of the training set and recommended using unsupervised techniques to improve predefining categories and labelling the textual documents in the training set.
Keywords
subtitle document classification , clustering , unsupervised learning , TF-IDF , BOW
Journal or Conference Name
2021 IEEE 6th International Conference on Computing, Communication and Automation (ICCCA)
Publication Year
2021
Indexing
scopus