DoR - Division of Research

Title: BAAD: A multipurpose dataset for automatic Bangla offensive speech recognition

Abstract: In spite of being the fifth most spoken native language in the world, Bangla has barely received any attention in the domain of audio and speech recognition. This article represents a speech dataset of Bengali Abusive Words with some non-abusive wors which are very close to the abusive ones. In this work, a multipurpose dataset is presented to recognize automatic slang speech for Bangla language, which was prepared by collection, annotation, and refinement of data. It consists of 114 slang words and 43 non-slang words with 6100 audio clips. For the collection of slang words, 60 native speakers and for non-abusive words, 23 native speakers participated who were, speaking in various dialects from over 20 districts of Bangladesh, and 10 university students participated to evaluate this dataset including annotation and refinements. Researchers can use this dataset to develop an automatic Bengali Slang speech recognition system, and also it can be used as a new benchmark for creating speech recognition-based machine learning models. This dataset can be enrich-ed further, and some background noise in the dataset can be used to simulate a more real-world scenario if desired. Otherwise, these noises could also be removed.

Keywords: Offensive speech, Bangla offensive speech, Speech recognition, Multipurpose dataset