Scopus Indexed Publications

Paper Details


Title
FeniVerse: A parallel corpus of Feni dialect, standard Bengali, and English

Author
Mehraj Hossain Mahi, Anzir Rahman Khan, Mayen Uddin Mojumdar, Zesanul Hoque,

Email

Abstract

FeniVerse is a trilingual parallel corpus comprising 4094 entries in English, Standard Bangla, and the Feni Dialect, totaling 12,282 sentence-aligned translations. It addresses the scarcity of datasets and tools for Bangla dialects, particularly the underrepresented Feni Dialect, spoken by approximately 1.6 million people in southeastern Bangladesh. As the first dataset for this dialect, FeniVerse is a valuable resource for computational linguistics. The corpus is manually curated, with each entry cross-checked by native Feni speakers and regional volunteers to ensure accuracy while preserving phonological, lexical, and syntactic variations. FeniVerse is openly accessible and designed for seamless integration into NLP (Natural Language Processing) pipelines. Existing AI and NLP models struggle to process the Feni Dialect due to limited training data, making this dataset critical for developing effective language technologies. The proposed dataset supports diverse applications, including dialect identification, machine translation, and cross-linguistic analysis, while promoting equitable AI development for low-resource languages. By introducing the first trilingual dataset for the Feni Dialect, FeniVerse contributes to computational linguistics and the digital preservation of regional Bangla dialects, supporting both academic research and practical AI applications.


Keywords

Journal or Conference Name
Data in Brief

Publication Year
2025

Indexing
scopus