Integration of BERT-VAD, MFCC-Delta, and VGG16 in Transformer-Based Fusion Architecture for Multimodal Emotion Classification
DOI:
https://doi.org/10.52436/1.jutif.2025.6.4.4915Keywords:
BERT, MFCC, Multimodal Emotion, NRC-VAD, Transformer-Based Fusion, VGG16Abstract
Emotion is a condition that plays an important role in human interaction and is the main focus of intelligence research in utilizing multimodal. Previous studies have classified multimodal emotions but are still less than optimal because they do not consider the complexity of human emotions as a whole. Although using multimodal data, the selection of feature extraction and the merging process are still less relevant to improving accuracy. This study attempts to categorize emotions and improve precision through a multimodal methodology that utilizes Transformer-based Fusion. The data used consists of a synthesis of three modalities: text (extracted through BERT and assessed through the affective dimensions of NRC Valence, Arousal, and Dominance), audio (extracted through MFCC and delta-delta2 from the RAVDESS and TESS datasets), and images (extracted through VGG16 on the FER-2013 dataset). The model is built by mapping each feature into an identical dimensional representation and processed through a Transformer block to simulate the interaction between modalities, known as feature-level interactions. The classification procedure is run through a dense layer with softmax activation. Model evaluation was performed using Stratified K-Fold Cross Validation with k=10. The evaluation results showed that the model achieved 95% accuracy in the ninth fold. This result shows a significant improvement from previous research at the feature level (73.55%), and underlines the effectiveness of the combination of feature extraction and Transformer-based Fusion. This study contributes to the field of emotion-aware systems in informatics, facilitating more adaptive, empathetic, and intelligent interactions between humans and computers in practical applications.
Downloads
References
World Health Organization, “World Mental Health Report: Transforming mental health for all,” 2022. [Online]. Available: https://www.who.int/publications/i/item/9789240049338
C. Singla, S. Singh, P. Sharma, N. Mittal, and F. Gared, “Emotion recognition for human–computer interaction using high-level descriptors,” Sci Rep, vol. 14, no. 1, p. 12122, May 2024, doi: 10.1038/s41598-024-59294-y.
R. Zhen, W. Song, Q. He, J. Cao, L. Shi, and J. Luo, “Human-Computer Interaction System: A Survey of Talking-Head Generation,” Electronics (Basel), vol. 12, no. 1, p. 218, Jan. 2023, doi: 10.3390/electronics12010218.
S. Zhang, Y. Yang, C. Chen, X. Zhang, Q. Leng, and X. Zhao, “Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects,” Expert Syst Appl, vol. 237, p. 121692, Mar. 2024, doi: 10.1016/j.eswa.2023.121692.
A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities,” Knowl Based Syst, vol. 244, p. 108580, May 2022, doi: 10.1016/j.knosys.2022.108580.
W. Chango, R. Cerezo, M. Sanchez-Santillan, R. Azevedo, and C. Romero, “Improving prediction of students’ performance in intelligent tutoring systems using attribute selection and ensembles of different multimodal data sources,” J Comput High Educ, vol. 33, no. 3, pp. 614–634, Dec. 2021, doi: 10.1007/s12528-021-09298-8.
A. U. Khan, A. Mazaheri, N. da V. Lobo, and M. Shah, “MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering,” ArXiv, Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.14095
B. Pan, K. Hirota, Z. Jia, and Y. Dai, “A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,” Neurocomputing, vol. 561, p. 126866, Dec. 2023, doi: 10.1016/j.neucom.2023.126866.
D. Salvi, B. Hosler, P. Bestagini, M. C. Stamm, and S. Tubaro, “TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection,” IEEE Access, vol. 11, pp. 50851–50866, 2023, doi: 10.1109/ACCESS.2023.3276480.
A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities,” Knowl Based Syst, vol. 244, p. 108580, May 2022, doi: 10.1016/j.knosys.2022.108580.
A. Subakti, H. Murfi, and N. Hariadi, “The performance of BERT as data representation of text clustering,” J Big Data, vol. 9, no. 1, p. 15, Dec. 2022, doi: 10.1186/s40537-022-00564-9.
A. Rogers, O. Kovaleva, and A. Rumshisky, “A Primer in BERTology: What We Know About How BERT Works,” Trans Assoc Comput Linguist, vol. 8, pp. 842–866, Dec. 2020, doi: 10.1162/tacl_a_00349.
A. Bello, S.-C. Ng, and M.-F. Leung, “A BERT Framework to Sentiment Analysis of Tweets,” Sensors, vol. 23, no. 1, p. 506, Jan. 2023, doi: 10.3390/s23010506.
K. Yang, T. Zhang, H. Alhuzali, and S. Ananiadou, “Cluster-Level Contrastive Learning for Emotion Recognition in Conversations,” IEEE Trans Affect Comput, vol. 14, no. 4, pp. 3269–3280, Oct. 2023, doi: 10.1109/TAFFC.2023.3243463.
S. P. Mishra, P. Warule, and S. Deb, “Speech emotion recognition using MFCC-based entropy feature,” Signal Image Video Process, vol. 18, no. 1, pp. 153–161, Feb. 2024, doi: 10.1007/s11760-023-02716-7.
F. S. AL-ANZI, “IMPROVED NOISE-RESILIENT ISOLATED WORDS SPEECH RECOGNITION USING PIECEWISE DIFFERENTIATION,” Fractals, vol. 30, no. 08, Dec. 2022, doi: 10.1142/S0218348X22402277.
A. Jawale and G. Magar, “MFCC Delta–Delta Energy Feature Extraction for Clustering of Road Surface Types,” International Journal of Pavement Research and Technology, vol. 16, no. 3, pp. 631–646, May 2023, doi: 10.1007/s42947-022-00153-2.
F. S. AL-ANZI, “IMPROVED NOISE-RESILIENT ISOLATED WORDS SPEECH RECOGNITION USING PIECEWISE DIFFERENTIATION,” Fractals, vol. 30, no. 08, Dec. 2022, doi: 10.1142/S0218348X22402277.
A. Alshehri and D. AlSaeed, “Breast Cancer Diagnosis in Thermography Using Pre-Trained VGG16 with Deep Attention Mechanisms,” Symmetry (Basel), vol. 15, no. 3, p. 582, Feb. 2023, doi: 10.3390/sym15030582.
W. Bakasa and S. Viriri, “VGG16 Feature Extractor with Extreme Gradient Boost Classifier for Pancreas Cancer Prediction,” J Imaging, vol. 9, no. 7, p. 138, Jul. 2023, doi: 10.3390/jimaging9070138.
T. M. Saravanan, K. Karthiha, R. Kavinkumar, S. Gokul, and J. P. Mishra, “A novel machine learning scheme for face mask detection using pretrained convolutional neural network,” Mater Today Proc, vol. 58, pp. 150–156, 2022, doi: 10.1016/j.matpr.2022.01.165.
J. Zhang, A. Liu, D. Wang, Y. Liu, Z. J. Wang, and X. Chen, “Transformer-Based End-to-End Anatomical and Functional Image Fusion,” IEEE Trans Instrum Meas, vol. 71, pp. 1–11, 2022, doi: 10.1109/TIM.2022.3200426.
S. Siriwardhana, T. Kaluarachchi, M. Billinghurst, and S. Nanayakkara, “Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion,” IEEE Access, vol. 8, pp. 176274–176285, 2020, doi: 10.1109/ACCESS.2020.3026823.
Y. Wang, Y. Gu, Y. Yin, Y. Han, H. Zhang, S. Wang et al., “Multimodal transformer augmented fusion for speech emotion recognition,” Front Neurorobot, vol. 17, May 2023, doi: 10.3389/fnbot.2023.1181598.
M. T R, V. K. V, D. K. V, O. Geman, M. Margala, and M. Guduri, “The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification,” Healthcare Analytics, vol. 4, p. 100247, Dec. 2023, doi: 10.1016/j.health.2023.100247.
A. K. Adepu, S. Sahayam, U. Jayaraman, and R. Arramraju, “Melanoma classification from dermatoscopy images using knowledge distillation for highly imbalanced data,” Comput Biol Med, vol. 154, p. 106571, Mar. 2023, doi: 10.1016/j.compbiomed.2023.106571.
H. Abu-Nowar, A. Sait, T. Al-Hadhrami, M. Al-Sarem, and S. Noman Qasem, “SENSES-ASD: a social-emotional nurturing and skill enhancement system for autism spectrum disorder,” PeerJ Comput Sci, vol. 10, p. e1792, Feb. 2024, doi: 10.7717/peerj-cs.1792.
G. G. Pushpa, J. Kotti, and Ch. Bindumadhuri, “Face Emotion Recognition Based on Images Using the Haar-Cascade Front End Approach,” 2024, pp. 331–339. doi: 10.1007/978-3-031-48888-7_28.
G. P. Kusuma, J. Jonathan, and A. P. Lim, “Emotion Recognition on FER-2013 Face Images Using Fine-Tuned VGG-16,” Advances in Science, Technology and Engineering Systems Journal, vol. 5, no. 6, pp. 315–322, 2020, doi: 10.25046/aj050638.
S. Mohammad, “Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Stroudsburg, PA, USA: Association for Computational Linguistics, 2018, pp. 174–184. doi: 10.18653/v1/P18-1017.
M. Hou, Z. Zhang, C. Liu, and G. Lu, “Semantic Alignment Network for Multi-Modal Emotion Recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 5318–5329, Sep. 2023, doi: 10.1109/TCSVT.2023.3247822.
D. Zhang, X. Ju, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal Multi-label Emotion Detection with Modality and Label Dependence,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA, USA: Association for Computational Linguistics, 2020, pp. 3584–3593. doi: 10.18653/v1/2020.emnlp-main.291.
A. Rahman Yusuf and A. Prasetyo, “The Use of Information Retrieval in Student Academic Document Plagiarism Detection System,” bit-Tech, vol. 6, no. 2, pp. 235–240, Dec. 2023, doi: 10.32877/bt.v6i2.1063.
H. A. Owida, A. Al-Ghraibah, and M. Altayeb, “Classification of Chest X-Ray Images using Wavelet and MFCC Features and Support Vector Machine Classifier,” Engineering, Technology & Applied Science Research, vol. 11, no. 4, pp. 7296–7301, Aug. 2021, doi: 10.48084/etasr.4123.
F. Aldi, F. Hadi, N. A. Rahmi, and S. Defit, “Standardscaler’s Potential in Enhancing Breast Cancer Accuracy Using Machine Learning,” Journal of Applied Engineering and Technological Science (JAETS), vol. 5, no. 1, pp. 401–413, Dec. 2023, doi: 10.37385/jaets.v5i1.3080.
H. Zhou, X. Wang, and R. Zhu, “Feature selection based on mutual information with correlation coefficient,” Applied Intelligence, vol. 52, no. 5, pp. 5457–5474, Mar. 2022, doi: 10.1007/s10489-021-02524-x.
T. Hoang, T.-T. Do, T. V. Nguyen, and N.-M. Cheung, “Multimodal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing,” IEEE Trans Neural Netw Learn Syst, vol. 34, no. 9, pp. 6289–6302, Sep. 2023, doi: 10.1109/TNNLS.2021.3135420.
A. Ishaq, S. Sadiq, M. Umer, S. Ullah, S. Mirjalili, V. Rupapara et al., “Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques,” IEEE Access, vol. 9, pp. 39707–39716, 2021, doi: 10.1109/ACCESS.2021.3064084.
S. Siriwardhana, T. Kaluarachchi, M. Billinghurst, and S. Nanayakkara, “Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion,” IEEE Access, vol. 8, pp. 176274–176285, 2020, doi: 10.1109/ACCESS.2020.3026823.
H.-D. Le, G.-S. Lee, S.-H. Kim, S. Kim, and H.-J. Yang, “Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning,” IEEE Access, vol. 11, pp. 14742–14751, 2023, doi: 10.1109/ACCESS.2023.3244390.
S. Prusty, S. Patnaik, and S. K. Dash, “SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer,” Frontiers in Nanotechnology, vol. 4, Aug. 2022, doi: 10.3389/fnano.2022.972421.
R. Romijnders, E. Warmerdam, C. Hansen, J. Welzel, G. Schmidt, and W. Maetzler, “Validation of IMU-based gait event detection during curved walking and turning in older adults and Parkinson’s Disease patients,” J Neuroeng Rehabil, vol. 18, no. 1, p. 28, Dec. 2021, doi: 10.1186/s12984-021-00828-0.
Additional Files
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Fisan Syafa Nayoma, Kusnawi

This work is licensed under a Creative Commons Attribution 4.0 International License.