Hybrid Model for Speech Emotion Recognition using Mel-Frequency Cepstral Coefficients and Machine Learning Algorithms

Odi Nurdiawan; Dian  Ade Kurnia; Dadang Sudrajat; Irfan Pratama

doi:10.52436/1.jutif.2025.6.5.5143

Authors

Odi Nurdiawan Informatics Management, STMIK IKMI Cirebon, Indonesia
Dian Ade Kurnia Informatics Management, STMIK IKMI Cirebon, Indonesia
Dadang Sudrajat Informatics Engineering, STMIK IKMI Cirebon, Indonesia
Irfan Pratama Information System, Mercu Buana University Yogyakarta, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2025.6.5.5143

Keywords:

Affective Computing, Audio Classification, Decision Tree, K-Nearest Neighbors, MFCC, Speech Emotion Recognition

Abstract

Speech Emotion Recognition (SER) is a subfield of affective computing that focuses on identifying human emotions through voice signals. Accurate emotion classification is essential for developing intelligent systems capable of interacting naturally with users. However, challenges such as background noise, overlapping emotional features, and speaker variability often reduce model performance. This study aims to develop a lightweight hybrid SER model by combining Mel-Frequency Cepstral Coefficients (MFCC) as feature representations with three machine learning algorithms: Support Vector Machine (SVM), Decision Tree (DT), and K-Nearest Neighbors (KNN). The methodology involves audio data preprocessing, MFCC-based feature extraction, and classification using the selected algorithms. The RAVDESS dataset, consisting of 1,440 English-language audio samples across four emotions (happy, angry, sad, neutral), was used with an 80/20 train-test split to ensure class balance.. Experimental results show that the KNN model achieved the highest performance, with an accuracy of 78.26%, precision of 85.09%, recall of 78.26%, and F1-score of 77.06%. The Decision Tree model produced comparable results, while the SVM model performed poorly across all metrics. These findings demonstrate that the proposed hybrid approach is effective for recognizing emotions in speech and offers a computationally efficient alternative to deep learning models. The integration of MFCC features with multiple machine learning classifiers provides a robust framework for real-time emotion recognition applications, especially in environments with limited computing resources.

Downloads

Download data is not yet available.

References

T. Swain, U. Anand, Y. Aryan, S. Khanra, A. Raj, and S. Patnaik, “Performance Comparison of LSTM Models for SER,” in Lecture Notes in Electrical Engineering, 2021, pp. 427–433. doi: 10.1007/978-981-33-4866-0_52.

W. Zeng, Y. Guo, G. He, and J. Zheng, “Research and implementation of an improved CGRU model for speech emotion recognition,” in ACM International Conference Proceeding Series, 2022, pp. 778–782. doi: 10.1145/3548608.3559306.

M. Hussain, S. Abishek, K. P. Ashwanth, C. Bharanidharan, and S. Girish, “Retraction: Feature Specific Hybrid Framework on composition of Deep learning architecture for speech emotion recognition,” J Phys Conf Ser, vol. 1916, no. 1, 2021, doi: 10.1088/1742-6596/1916/1/012094.

S. Lata, N. Kishore, and P. Sangwan, “SENTIMENT ANALYSIS ON SPEECH SIGNALS: LEVERAGING MFCC-LSTM TECHNIQUE FOR ENHANCED EMOTIONAL UNDERSTANDING,” Proceedings on Engineering Sciences, vol. 6, no. 3, pp. 1391–1402, 2024, doi: 10.24874/PES.SI.25.03A.015.

Y. Badr, P. Mukherjee, and S. M. Thumati, “Speech Emotion Recognition using MFCC and Hybrid Neural Networks,” in ICETE International Conference on E-Business and Telecommunication Networks (International Joint Conference on Computational Intelligence), 2021, pp. 366–373. [Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85140900940&partnerID=40&md5=b66a5fac223223328eb4d41ba78b5d05

Y. Badr, P. Mukherjee, and S. M. Thumati, “Speech Emotion Recognition using MFCC and Hybrid Neural Networks,” in International Joint Conference on Computational Intelligence, 2021, pp. 366–373. doi: 10.5220/0010707400003063.

S. Padman and D. Magare, “Speech Emotion Recognition using Hybrid Textual Features, MFCC and Deep Learning Technique,” in 7th International Conference on Trends in Electronics and Informatics, ICOEI 2023 - Proceedings, 2023, pp. 1264–1271. doi: 10.1109/ICOEI56765.2023.10125805.

A. Shaik, G. Prabhakar Reddy, R. Vidya, J. Varsha, G. Jayasree, and L. Sriveni, “Hybrid CNN-LSTM Framework for Robust Speech Emotion Recognition,” in Proceedings - International Research Conference on Smart Computing and Systems Engineering, SCSE 2025, 2025. doi: 10.1109/SCSE65633.2025.11031070.

F. Andayani, L. B. Theng, M. T. Tsun, and C. Chua, “Recognition of Emotion in Speech-related Audio Files with LSTM-Transformer,” in 5th International Conference on Computing and Informatics, ICCI 2022, 2022, pp. 87–91. doi: 10.1109/ICCI54321.2022.9756100.

J. Ning and W. Zhang, “Speech-based emotion recognition using a hybrid RNN-CNN network,” Signal Image Video Process, vol. 19, no. 1, 2025, doi: 10.1007/s11760-024-03574-7.

F. Makhmudov, A. Kutlimuratov, and Y.-I. Cho, “Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition,” Applied Sciences (Switzerland), vol. 14, no. 23, 2024, doi: 10.3390/app142311342.

Y. Zhou and X. Xie, “Speech Emotion Recognition Based on 1D-CNNs-LSTM Hybrid Model,” in 2023 3rd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology, CEI 2023, 2023, pp. 220–224. doi: 10.1109/CEI60616.2023.10527889.

F. Andayani, L. B. Theng, M. T. Tsun, and C. Chua, “Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files,” IEEE Access, vol. 10, pp. 36018–36027, 2022, doi: 10.1109/ACCESS.2022.3163856.

A. Namey and K. Akter, “CochleaTion: Speech Emotion Recognition Through Cochleagram with CNN-GRU and Attention Mechanism,” in Proceedings - 6th International Conference on Electrical Engineering and Information and Communication Technology, ICEEICT 2024, 2024, pp. 1118–1123. doi: 10.1109/ICEEICT62016.2024.10534550.

A. Anika Namey, K. Akter, M. A. Hossain, and M. Ali Akber Dewan, “CochleaSpecNet: An Attention-Based Dual Branch Hybrid CNN-GRU Network for Speech Emotion Recognition Using Cochleagram and Spectrogram,” IEEE Access, vol. 12, pp. 190760–190774, 2024, doi: 10.1109/ACCESS.2024.3517733.

C. Suneetha and R. Anitha, “Enhanced Speech Emotion Recognition Using the Cognitive Emotion Fusion Network for PTSD Detection with a Novel Hybrid Approach,” Journal of Electrical Systems, vol. 19, no. 4, pp. 376–398, 2023, doi: 10.52783/jes.644.

C. Sun, L. Ji, and H. Zhong, “Speech Emotion Recognition on Small Sample Learning by Hybrid WGAN-LSTM Networks,” Journal of Circuits, Systems and Computers, vol. 31, no. 4, 2022, doi: 10.1142/S0218126622500736.

A. Islam, M. Foysal, and M. I. Ahmed, “Emotion Recognition from Speech Audio Signals using CNN-BiLSTM Hybrid Model,” in 2024 3rd International Conference on Advancement in Electrical and Electronic Engineering, ICAEEE 2024, 2024. doi: 10.1109/ICAEEE62219.2024.10561755.

I. Baklouti, O. B. Ahmed, R. Baklouti, and C. Fernandez, “Cross-Lingual Transfert Learning for Speech Emotion Recognition,” in 7th IEEE International Conference on Advanced Technologies, Signal and Image Processing, ATSIP 2024, 2024, pp. 559–563. doi: 10.1109/ATSIP62566.2024.10638918.

L. Yue, P. Hu, S.-C. Chu, and J.-S. Pan, “Genetic Algorithm for High-Dimensional Emotion Recognition from Speech Signals,” Electronics (Switzerland), vol. 12, no. 23, 2023, doi: 10.3390/electronics12234779.

L. Yue, P. Hu, S.-C. Chu, and J.-S. Pan, “Multi-Objective Equilibrium Optimizer for Feature Selection in High-Dimensional English Speech Emotion Recognition,” Computers, Materials and Continua, vol. 78, no. 2, pp. 1957–1975, 2024, doi: 10.32604/cmc.2024.046962.

C. A. Kumara, K. A. Sheelab, and N. K. Vodnalac, “Analysis of Emotions from Speech using Hybrid Deep Learning Network Models,” in 2022 International Conference on Futuristic Technologies, INCOFT 2022, 2022. doi: 10.1109/INCOFT55651.2022.10094442.

J. Bhanbhro, S. Talpur, and A. A. Memon, “Speech Emotion Recognition Using Deep Learning Hybrid Models,” in ICETECC 2022 - International Conference on Emerging Technologies in Electronics, Computing and Communication, 2022. doi: 10.1109/ICETECC56662.2022.10069212.

K. Kaur and P. Singh, “Extraction and Analysis of Speech Emotion Features Using Hybrid Punjabi Audio Dataset,” in Communications in Computer and Information Science, 2023, pp. 275–287. doi: 10.1007/978-3-031-27609-5_22.

S. P. Singh, S. Kumar, S. Verma, and I. Kaur, “Hybrid Approach for Human Emotion Recognition from Speech,” in Proceedings - 2022 4th International Conference on Advances in Computing, Communication Control and Networking, ICAC3N 2022, 2022, pp. 1282–1285. doi: 10.1109/ICAC3N56670.2022.10074492.

H. Li, Y. Zhang, and S. Liu, “AMH-Net: Adaptive Multi-Band Hybrid-Aware Network for Emotion Recognition in Speech,” IEEE Signal Process Lett, vol. 32, pp. 2344–2348, 2025, doi: 10.1109/LSP.2025.3568357.

C. Li, Y. Gu, H. Zhang, L. Liu, H. Lin, and S. Wang, “Hybrid Contrastive Learning Decoupling Speech Emotion Recognition,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2025. doi: 10.1109/ICASSP49660.2025.10889881.

C. Barhoumi and Y. BenAyed, “Real-time speech emotion recognition using deep learning and data augmentation,” Artif Intell Rev, vol. 58, no. 2, 2025, doi: 10.1007/s10462-024-11065-x.

S. M. H. Ali Shuvo and R. Khan, “Bangla Speech-based Emotion Detection using a Hybrid CNN-Transformer Approach,” in Proceedings - 2023 8th International Conference on Communication, Image and Signal Processing, CCISP 2023, 2023, pp. 163–167. doi: 10.1109/CCISP59915.2023.10355685.

A. Marik, S. Chattopadhyay, and P. K. Singh, “A hybrid deep feature selection framework for emotion recognition from human speeches,” Multimed Tools Appl, vol. 82, no. 8, pp. 11461–11487, 2023, doi: 10.1007/s11042-022-14052-y.

S. I. Ahmed, S. M. Sarkar, S. A. Fattah, and M. Saquib, “Classical to Quantum Neural Network Transfer Learning Approach for Speech Emotion Recognition,” in IEEE Region 10 Annual International Conference, Proceedings/TENCON, 2024, pp. 1478–1482. doi: 10.1109/TENCON61640.2024.10902713.

R. Sharma and A. Pradhan, “Implementation of Machine Learning based Optimized Speech Emotion Recognition,” in 2nd International Conference on Automation, Computing and Renewable Systems, ICACRS 2023 - Proceedings, 2023, pp. 1090–1095. doi: 10.1109/ICACRS58579.2023.10405195.

H. Tao, L. Geng, S. Shan, J. Mai, and H. Fu, “Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition,” Entropy, vol. 24, no. 8, 2022, doi: 10.3390/e24081025.

T. Das, M. F. Islam, and N. Mamun, “Attention-based Multi-level Feature Fusion for Multilingual Speech Emotion Recognition,” in 2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025, 2025. doi: 10.1109/ECCE64574.2025.11013794.

N. Mobassara, N. Alam, and N. Mamun, “A Comprehensive Review of Speech Emotions Recognition using Machine Learning,” in 2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025, 2025. doi: 10.1109/ECCE64574.2025.11013787.

V. S. S. L. D. Janapa, S. K. M. Machiraju, B. A. K. Yekula, L. R. Karri, V. Thanneru, and M. Srinivas, “Bridging the Emotional Gap in AI: A Study on Speech Emotion Recognition for Adaptive Human Computer Interaction,” in 2025 International Conference on Artificial Intelligence and Data Engineering, AIDE 2025 - Proceedings, 2025, pp. 105–111. doi: 10.1109/AIDE64228.2025.10987381.

S. Kour, P. Sharma, A. M. Zargar, A. Sonania, and T. Hassan, “Emotion Recognition from Speech Signals Using Hybrid CNN Model,” in Proceedings - 3rd International Conference on Advancement in Computation and Computer Technologies, InCACCT 2025, 2025, pp. 666–670. doi: 10.1109/InCACCT65424.2025.11011474.

S. Huang, H. Dang, R. Jiang, Y. Hao, C. Xue, and W. Gu, “Multi-layer hybrid fuzzy classification based on svm and improved pso for speech emotion recognition,” Electronics (Switzerland), vol. 10, no. 23, 2021, doi: 10.3390/electronics10232891.

S. Kakuba and D. S. Han, “Speech Emotion Recognition using Context-Aware Dilated Convolution Network,” in APCC 2022 - 27th Asia-Pacific Conference on Communications: Creating Innovative Communication Technologies for Post-Pandemic Era, 2022, pp. 601–604. doi: 10.1109/APCC55198.2022.9943771.

Y. Wang et al., “Multimodal transformer augmented fusion for speech emotion recognition,” Front Neurorobot, vol. 17, 2023, doi: 10.3389/fnbot.2023.1181598.