DoR - Division of Research

Title: Advancing Bangla NLP: A Systematic Evaluation of Preprocessing Techniques for Improved Text Classification

Author: Md. Istiak Tanvir, Asma Akter, MD. SAJIB AHAMMAD, MD. TAHMID, Shumona Akter Sraboni, Tanvirul Islam,

Abstract: Bangla NLP is faced with several challenges due to the rich morphology of the language, its diverse dialects and metaphorical expressions. While traditional methods of preprocessing provide a simple technique for text normalization, they cannot be successful in braving these language-related problems. This work fills this gap by suggesting and comprehensively analyzing a new preprocessing pipeline comprising six important techniques: word correction, word splitting, detection of metaphors, identification of dialect, replacement by synonyms and Bangla number-to-text. A comparison evaluation was carried out by concatenating the techniques to verify if they can improve text classification issues on a data set of 67564 points with 19490647 words. The findings report that using this specific preprocessing pipeline results in a 14.3% accuracy gain in the Bangla BERT model from raw and untreated text. Results are valuable indicators of optimizing Bangla text processing techniques towards meaningful development in Bangla NLP applications.