The advent of pre-trained
language models has directed a new era of Natural Language Processing
(NLP), enabling us to create powerful language models. Among these
models, Transformer-based models like BERT have grown in popularity due
to their cutting-edge effectiveness. However, these models heavily rely
on resource-intensive languages, forcing other languages into
multilingual models(mBERT). The two fundamental challenges with mBERT
become significantly more challenging in a resource-constrained language
like Bangla. It was trained on a limited and organized dataset and
contained weights for all other languages. Besides, current research on
other languages suggests that a language-specific BERT model will exceed
multilingual ones. This paper introduces Bangla-BERT,
a
a monolingual BERT model for the Bangla language. Despite the limited
data available for NLP tasks in Bangla, we perform pre-training on the
largest Bangla language model dataset, BanglaLM, which we constructed
using 40 GB of text data. Bangla-BERT achieves the highest results in
all datasets and vastly improves the state-of-the-art performance in
binary linguistic classification, multilabel extraction, and named
entity recognition, outperforming multilingual BERT and other previous
research. The pre-trained model is assessed against several
non-contextual models such as Bangla fasttext and word2vec the
downstream tasks. Finally, this model is evaluated by transfer learning
based on hybrid deep learning models such as LSTM, CNN, and CRF in NER,
and it is observed that Bangla-BERT outperforms state-of-the-art
methods. The proposed Bangla-BERT model is assessed by using benchmark
datasets, including Banfakenews, Sentiment Analysis on Bengali News
Comments, and Cross-lingual Sentiment Analysis in Bengali. Finally, it
is concluded that Bangla-BERT surpasses all prior state-of-the-art
results by 3.52%, 2.2%, and 5.3%.