DoR - Division of Research

Scopus Indexed Publications

Paper Details

Title: MELD: A multilingual ethnic dataset of Chakma, Garo, and Marma in Bengali script with English and standard Bengali translation

Author: Mehraj Hossain Mahi, Anzir Rahman Khan, Arif Mahmud, Mayen Uddin Mojumdar, MOBASHSHER HASAN ANIK, Sheak Rashed Haider Noori,

Email

Abstract

There are thousands of ethnic groups in the world contributing to the rich linguistic and cultural diversity of people. However, in digital resources and research, the majority of these languages, including more than 30 ethnic languages spoken in Bangladesh remain severely underrepresented. There is little to no work addressing the preservation, translation, or computational processing of these languages, despite their unique linguistic structures and speaker population. In order to highlight the difficulties faced by low-resource and endangered languages worldwide, this dataset focuses on the three leading ethnic languages, Chakma, Garo, and Marma along with their corresponding Bengali and English translations. People from different ethnic groups use Bengali alphabets to write their own language on social media platforms like Facebook and Twitter, as well as in their daily lives. Due to significant linguistic variances, even when ethnic native speakers use the Bengali script to write their languages, the resulting text is unintelligible to Standard Bengali speakers. Moreover, the lack of translation systems and language identification tools indicate the digital exclusion of these communities. This dataset addresses these gaps by documenting sentence-level linguistic samples in Chakma, Garo, and Marma through transliteration, where the phonetics of each language are represented using Bengali script. It also provides meaning-based translations in both Standard Bengali and English, rather than literal word-for-word mappings, to preserve the intended meaning of the original sentences. By documenting linguistic samples in Chakma, Garo, and Marma through a transliteration process, this dataset is a critical resource for advancing Natural Language Processing (NLP) and cultural preservation for worldwide low-resource ethnic languages.

Keywords

Journal or Conference Name: Data in Brief

Publication Year: 2025

Indexing: scopus