Scopus Indexed Publications
Paper Details
- Title
-
Movie Subtitle Document Classification Using Unsupervised Machine Learning Approach
- Author
-
Md. Mehedi Hasan,
Imrus Salehin,
Sadia Tamim Dip,
Sonia Akter,
T. M. Kamruzzaman,
- Email
-
- Abstract
-
Since the evolution of digital
and online text content, automatic document classification has become a
significant research issue. There is a most commonly used machine
learning approach to improve this task: an unsupervised approach, where
no human interaction or labelling documents are required at any point
throughout the whole procedure. This study addressed an approach for
movie subtitle document classification using an unsupervised machine
learning technique. The dataset has been created, collecting almost 500
English movie subtitle files based on the popular movies of IMDB. Two
feature extraction methods have been used and combined with unsupervised
machine learning algorithms and a dimension reduction technique has
been used to reduce the dimensionality of this work. As unsupervised
machine learning techniques, we used Bisecting K-Means, K-Means and
Agglomerative Hierarchical Clustering Algorithm; Average link, Single
Link and Double link. We assessed that K-means and Bisecting k-means are
the best performers of the unsupervised techniques in the term of
cluster quality. We addressed the reason for the outliers of the
training set and recommended using unsupervised techniques to improve
predefining categories and labelling the textual documents in the
training set.
- Keywords
-
subtitle document classification , clustering , unsupervised learning , TF-IDF , BOW
- Journal or Conference Name
- 2021 IEEE 6th International Conference on Computing, Communication and Automation (ICCCA)
- Publication Year
-
2021
- Indexing
-
scopus