Background: Breast cancer, behind skin cancer, is
the second most frequent malignancy among women, initiated by an
unregulated cell division in breast tissues. Although early mammogram
screening and treatment result in decreased mortality, differentiating
cancer cells from surrounding tissues are often fallible, resulting in
fallacious diagnosis. Method: The mammography dataset is used to
categorize breast cancer into four classes with low computational
complexity, introducing a feature extraction-based approach with machine
learning (ML) algorithms. After artefact removal and the preprocessing
of the mammograms, the dataset is augmented with seven augmentation
techniques. The region of interest (ROI) is extracted by employing
several algorithms including a dynamic thresholding method. Sixteen
geometrical features are extracted from the ROI while eleven ML
algorithms are investigated with these features. Three ensemble models
are generated from these ML models employing the stacking method where
the first ensemble model is built by stacking ML models with an accuracy
of over 90% and the accuracy thresholds for generating the rest of the
ensemble models are >95% and >96. Five feature selection methods
with fourteen configurations are applied to notch up the performance.
Results: The Random Forest Importance algorithm, with a threshold of
0.045, produces 10 features that acquired the highest performance with
98.05% test accuracy by stacking Random Forest and XGB classifier,
having a higher than >96% accuracy. Furthermore, with K-fold
cross-validation, consistent performance is observed across all K values
ranging from 3–30. Moreover, the proposed strategy combining image
processing, feature extraction and ML has a proven high accuracy in
classifying breast cancer.