DoR - Division of Research

Scopus Indexed Publications

Paper Details

Title: BanglaRegionalTextCorpus: A curated dataset for four regional bangla dialects with standard Bangla and English translation

Author: , Taslima Akhter, Umme Ayman, Zannatul Mawa Koli,

Email

Abstract

The BanglaRegionalTextCorpus is introduced as a curated dataset documenting four regional Bangla dialects: Rangpur, Barisal, Narail, and Khulna along with their corresponding Standard Bangla and English translations. The corpus contains 4653 manually validated sentences, collected from community interactions, field recordings, and publicly available digital sources. Rigorous pre-processing steps, including duplicate removal, normalization, and linguistic validation by native speakers, were employed to ensure data accuracy and consistency. This dataset serves as a comprehensive resource for dialect identification, machine translation, and text classification, as well as for research in sociolinguistics and regional language variation. By capturing phonetic, lexical, and syntactic distinctions across four dialects, it enables the development of inclusive and context-aware NLP models for low-resourced languages. Furthermore, the dataset supports comparative linguistic studies between regional and standardized Bangla, contributing to the preservation and computational representation of dialectal diversity. The BanglaRegionalTextCorpus provides a benchmark resource for future research in Bangla NLP, promoting collaboration, cultural preservation, and equitable language technology development across diverse linguistic communities.

Keywords

Journal or Conference Name: Data in Brief

Publication Year: 2026

Indexing: scopus