Early and accurate segmentation of oral cancer is essential for timely diagnosis and treatment. Traditional methods like visual inspections and biopsies are often subjective and costly, which can hinder early detection. To improve segmentation accuracy for binary and multiclass classification, we propose a transformer-based ensemble model that combines Vision Transformer (ViT), Data-efficient Image Transformer (DeiT), Swin Transformer, and BEiT. This ensemble utilizes self-attention mechanisms for better feature extraction and spatial representation. Our study employs two datasets: the MOD dataset (463 images of oral diseases) and a histopathological dataset (1,224 images of oral squamous cell carcinoma and normal epithelium). We applied extensive preprocessing and augmentation techniques, such as grayscale conversion, binary thresholding, and Contrast Limited Adaptive Histogram Equalization (CLAHE), to enhance image quality and model generalization. The performance evaluation showed that our ensemble model outperformed individual architectures, achieving an Intersection over Union (IoU) of 0.9601 and a Dice Coefficient of 0.9598 for binary segmentation, and IoU of 0.9587 and Dice Coefficient of 0.9575 for multiclass segmentation. A comparative analysis with state-of-the-art models confirmed the effectiveness of our approach. These results demonstrate the potential of transformer-based ensemble learning for oral cancer diagnosis, presenting a scalable tool for clinical applications. Future work will focus on expanding dataset diversity, optimizing computational efficiency, and integrating real-time inference for improved usability in healthcare.