Synthetic speeches, infamously known as Audio Deepfakes (AD), can easily become a menacing tool if they fall into the wrong hands. Moreover, social networks, which are highly vulnerable to deepfake attacks, can potentially cause social chaos. In order to tackle any potential harm, the misuse of deepfakes needs to be prevented. Especially for deepfake audio, detection methods are crucial to tackling the spread of deepfake speeches. In this study, we proposed a deep learning (DL) framework, the Convolutional Neural Network (CNN), to detect deepfake Bengali speeches. Through this study, we contributed to the resolution of certain research gaps, such as the limited number of dedicated researches and the scarcity of Bengali audio datasets comprising the Bengali domain in this field. The proposed model was applied on a set of primary self-created Bengali audio data. As a result, the CNN framework achieved the highest score of 98.24%, compared to a one-dimensional representation foundational CNN model.