Scopus Indexed Publications

Paper Details


Title
Offensive Language Identification Using Hindi-English Code-Mixed Tweets, and Code-Mixed Data Augmentation
Author
, Mominul Islam,
Email
Abstract

The Code-mixed text classification is challenging due to the lack of code-mixed labeled datasets and
the non-existence of pre-trained models. This paper presents the HASOC-2021 offensive language
identification results and main findings on code-mixed (Hindi-English) Subtask2. In this work, we have
proposed a new method of code-mixed data augmentation using synonym replacement of Hindi and
English words using WordNet, and phonetics conversion of Hinglish (Hindi-English) words. We used a
5.7k pre-annotated HASOC-2021 code-mixed dataset for training and data augmentation. The proposal’s
feasibility was tested with a Logistic Regression (LR) used as a baseline, Convolutional Neural Network
(CNN), and BERT with and without data augmentation. The research outcomes were promising and
yields almost 3% increase of classifier accuracy and F1 scores as compared to baseline. Our official
submission showed a 66.56% F1 score and ranked 8th position in the competition.

Keywords
Code-mixed Hindi-Englsih, Offensive language identification, Code-mixed Data Augmentation
Journal or Conference Name
CEUR Workshop Proceedings
Publication Year
2021
Indexing
scopus