Research in video compression has seen significant advancement in the last several years. However, the existing deep learning-based algorithms continue to be plagued by erroneous motion compression and ineffective motion compensation architectures, resulting in compression errors with a lower rate–distortion trade-off. To overcome these challenges, we present an end-to-end purely deep learning-based video compression method through a set of primary operations (e.g., motion estimation, motion compression, motion compensation, residual compression, and artifact contraction) differently. A deep residual attention split (DRAS) block is introduced for motion compression networks to pay more attention to certain image regions to create more effective features for the decoder while boosting the rate–distortion optimization (RDO) efficiency. A channel residual block (CRB) is proposed in motion compensation to yield a more accurate predicted frame, potentially improving the residual frame. To mitigate the compression errors, an artifact contraction module (ACM) by residual swin convolution UNet block is included in this model to improve the reconstruction quality. To improve the final frame, a buffer is added to fine-tune the previous reference frames. These modules combine with a loss function by assessing the trade-off and enhancing the decoded video quality. A comprehensive ablation study demonstrates the effectiveness of the proposed blocks and modules for video compression. Experimental results show the competitive performance of the proposed method on four benchmark datasets.