DoR - Division of Research

Title: Analyzing and Mitigating Linguistic Bias in Bangla Natural Language Processing

Abstract: Bangla (Bengali) is among the world’s most spoken languages, although no research has explored the model bias of its NLP. This paper investigates the model bias in Bangla text classification empirically, using critical fairness measures such as Demographic Parity, Equalized Odds and Accuracy Parity. We compare the performance of baseline machine learning models (Support Vector Machine, Random Forest) with deep models such as LSTM and Bangla-BERT for a large publicly available sentiment analysis data set. Every model is shown to possess quantifiable bias, the highest baseline fairness of which is that of Bangla-BERT. We establish that the use of targeted bias reduction methods makes a striking difference in improving fairness for all models. A high-resolution error analysis uncovers misclassification in the shape of skewed social and cultural assumptions embedded in the biased class distribution of data. This paper presents a unique comparison of Bangla NLP bias and provides insightful and practical suggestions for creating more fair and ethical AI systems for low-resource languages.