Scopus Indexed Publications

Paper Details


Title
BanglaTense: A large-scale dataset of Bangla sentences categorized by tense: Past, present, and future
Author
Md. Hasan Imam Bijoy, Mr. Md. Monarul Islam, Umme Ayman,
Email
Abstract

Bengali, an Indo-Aryan language, features a complex grammatical structure with tenses, which is crucial for natural language processing (NLP) applications like text classification, machine translation, and sentiment analysis. The BanglaTense dataset is a large-scale, meticulously curated collection of Bangla sentences categorized by their tense: Past, present, and future. Addressing the resource gap in NLP for the Bangla language, BanglaTense provides a curated resource for Bangla sentence classification, featuring 17,819 annotated sentences, with 5,629 in the past tense, 6,101 in the present tense, and 6,089 in the future tense. This dataset is a benchmark for evaluating NLP models on Bangla sentence classification, promoting linguistic diversity and inclusive language models while ensuring balanced representation across categories. Preprocessing steps are applied to enhance data quality, including anonymization and duplicate removal. Three native Bangla speakers independently assessed the tense labels of the sentences, ensuring the dataset's reliability. BanglaTense is designed to advance research and development in NLP for Bangla, offering valuable applications in tense detection, text classification, language modeling, and educational tools. This dataset supports linguistic study and enhances the development of precise and context-aware NLP models by providing a robust foundation for temporal analysis in Bangla sentences. The dataset is openly available for academic and research purposes, promoting collaboration and innovation within the Bangla NLP community.

Keywords
Journal or Conference Name
Data in Brief
Publication Year
2025
Indexing
scopus