DoR - Division of Research

Scopus Indexed Publications

Paper Details

Title: BanglaVerb: A sentence-level dataset for transitivity classification in Bangla NLP

Author: Zannatul Mawa Koli, Aliza Ahmed Khan, Md. Jahidul Alam, Zakia Sultana,

Email

Abstract: This article presents BanglaVerb, a systematically curated and linguistically validated sentence-level dataset designed to support transitivity classification in Bangla. The dataset contains 3001 Bangla sentences, each centered on a single verb instance annotated as either transitive (1634) or intransitive (1367). It was developed to address the lack of verb-focused linguistic resources for Bangla, a morphologically rich but under-resourced language in the NLP domain. Sentences were collected from diverse public sources, standardized, and carefully cleaned to ensure textual integrity. Annotation combined rule-based pre-labeling with expert linguistic verification, resulting in a 92% majority-voting agreement among annotators, which reflects high labeling consistency and reliability. Beyond its annotation framework, the dataset provides detailed lexical and structural statistics, including vocabulary size, character length distributions, n-gram patterns, and frequency distributions that follow Zipf’s law, confirming its linguistic representativeness. Baseline experiments using multiple machine learning models demonstrate strong classification performance, indicating the dataset’s clarity and robustness. By bridging sentence-level structure with verb semantics, BanglaVerb offers a high-quality, openly accessible resource that can support a wide range of downstream applications, including lemmatization, morphological analysis, syntactic parsing, semantic role labeling, and the development of verb-aware language models for Bangla.

Keywords

Journal or Conference Name: Data in Brief

Publication Year: 2026

Indexing: scopus