Improving Infant Cry Recognition Using MFCC And CNN-Based Audio Augmentation

Authors

  • Nuk Ghurroh Setyoningrum Department Of Informatics, Universitas Amikom Yogyakarta, Indonesia
  • Ema Utami Department Of Informatics, Universitas Amikom Yogyakarta, Indonesia
  • Kusrini Department Of Informatics, Universitas Amikom Yogyakarta, Indonesia
  • Ferry Wahyu Wibowo Department Of Informatics, Universitas Amikom Yogyakarta, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2025.6.2.4373

Keywords:

Audio Data Augmentation, Cry Pattern Classification, Infant Cry Recognition, MFCC Feature Extraction, Speech Signal Processing

Abstract

Recognizing infant cries is essential for understanding a baby's needs; however, previous research has struggled with imbalanced datasets and limited feature extraction techniques. Conventional methods utilizing CNN without data augmentation often failed to accurately classify minority classes such as belly pain, burping, and discomfort, resulting in biased models that predominantly recognized majority classes. This study proposes an MFCC-based data augmentation pipeline, incorporating time stretching, pitch scaling, noise addition, polarity inversion, and random gain adjustments to increase dataset diversity and enhance model generalization. By applying this approach, the dataset size was expanded from 457 to 8,683 samples, and a CNN model with three convolutional layers, ReLU activation, and max pooling was trained for cry pattern classification. The results indicate a substantial accuracy improvement from 78% to 98%, with F1-scores for minority classes rising from 0.00 to above 0.90, confirming that augmentation effectively addresses dataset imbalance. This research advances computer science and artificial intelligence, particularly in audio signal processing and deep learning for healthcare applications, by demonstrating the role of data augmentation in improving cry classification performance. Future directions include integrating multimodal data (visual and physiological signals), exploring advanced deep learning architectures, and developing real-time applications for smart baby monitoring systems to further enhance infant cry recognition technology.

Downloads

Download data is not yet available.

References

G. Vankudre, V. Ghulaxe, A. Dhomane, S. Badlani, and T. Rane, “A Survey on Infant Emotion Recognition through Video Clips,” in Proceedings of 2nd IEEE International Conference on Computational Intelligence and Knowledge Economy, ICCIKE 2021, Institute of Electrical and Electronics Engineers Inc., Mar. 2021, pp. 296–300. doi: 10.1109/ICCIKE51210.2021.9410786.

A. F. R. Nogueira, H. S. Oliveira, J. J. M. Machado, and J. M. R. S. Tavares, “Sound Classification and Processing of Urban Environments: A Systematic Literature Review,” Sensors, vol. 22, no. 22, Nov. 2022, doi: 10.3390/s22228608.

G.-V. Morfi, “Automatic detection and classification of bird sounds in low-resource wildlife audio datasets,” 2019.

N. Zheng, J. Wen, R. Liu, L. Long, J. Dai, and Z. Gong, “Unsupervised Representation Learning with Long-Term Dynamics for Skeleton Based Action Recognition.” [Online]. Available: www.aaai.org

P. Inkeaw, “Mel Frequency Cepstral Coefficient MFCC.”

A. Kumar, D. R. P. M. Vincent, K. Srinivasan, and C. Y. Chang, “Deep Convolutional Neural Network based Feature Extraction with optimized Machine Learning Classifier in Infant Cry Classification,” in 2020 International Conference on Decision Aid Sciences and Application, DASA 2020, Institute of Electrical and Electronics Engineers Inc., Nov. 2020, pp. 27–32. doi: 10.1109/DASA51403.2020.9317240.

S. Jain and B. Kishore, “Comparative study of voice print Based acoustic features: MFCC and LPCC,” International Journal of Advanced engineering, Management and Science, vol. 3, no. 4, pp. 313–315, 2017, doi: 10.24001/ijaems.3.4.5.

G. Iglesias, E. Talavera, Á. González-Prieto, A. Mozo, and S. Gómez-Canaval, “Data Augmentation techniques in time series domain: a survey and taxonomy,” May 01, 2023, Springer Science and Business Media Deutschland GmbH. doi: 10.1007/s00521-023-08459-3.

Z. K. D. Alkayyali, S. Anuar Bin Idris, and S. S. Abu-Naser, “A NEW ALGORITHM FOR AUDIO FILES AUGMENTATION,” J Theor Appl Inf Technol, vol. 30, no. 12, 2023, [Online]. Available: www.jatit.org

A. R. Ambili and R. C. Roy, “The Effect of Synthetic Voice Data Augmentation on Spoken Language Identification on Indian Languages,” IEEE Access, vol. 11, pp. 102391–102407, 2023, doi: 10.1109/ACCESS.2023.3316142.

S. Y. Chuang, H. M. Wang, and Y. Tsao, “Improved Lite Audio-Visual Speech Enhancement,” IEEE/ACM Trans Audio Speech Lang Process, vol. 30, pp. 1345–1359, 2022, doi: 10.1109/TASLP.2022.3153265.

K. Shea, O. St-Cyr, and T. Chau, “Ecological Design of an Augmentative and Alternative Communication Device Interface,” 2021.

H. T. Xu, J. Zhang, and L. R. Dai, “Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2022, pp. 1963–1967. doi: 10.21437/Interspeech.2022-18.

V. A. Kherdekar, “Convolution Neural Network Model for Recognition of Speech for Words used in Mathematical Expression,” 2021.

Q. M. M. Zarandah, S. Mohd Daud, and S. S. Abu-Naser, “SPECTROGRAM FLIPPING: A NEW TECHNIQUE FOR AUDIO AUGMENTATION,” J Theor Appl Inf Technol, vol. 15, no. 11, 2023, [Online]. Available: www.jatit.org

M. Margaryan, M. Seibold, I. Joshi, M. Farshad, P. Fürnstahl, and N. Navab, “Improved Techniques for the Conditional Generative Augmentation of Clinical Audio Data,” Nov. 2022, [Online]. Available: http://arxiv.org/abs/2211.02874

T. Wang, H. Guo, Q. Zhang, and Z. Yang, “A new multilayer graph model for speech signals with graph learning,” Apr. 15, 2022, Elsevier Inc. doi: 10.1016/j.dsp.2021.103360.

K. Zhang, H. N. Ting, and Y. M. Choo, “Baby cry recognition based on WOA-VMD and an improved Dempster–Shafer evidence theory,” Comput Methods Programs Biomed, vol. 245, Mar. 2024, doi: 10.1016/j.cmpb.2024.108043.

K. Zhang, H. N. Ting, and Y. M. Choo, “Baby Cry Recognition by BCRNet Using Transfer Learning and Deep Feature Fusion,” IEEE Access, vol. 11, pp. 126251–126262, 2023, doi: 10.1109/ACCESS.2023.3330789.

T. Zhang, C. Hong, Y. Zou, and J. Zhao, “Prediction method of human defecation based on informer audio data augmentation and improved residual network,” Heliyon, vol. 10, no. 14, Jul. 2024, doi: 10.1016/j.heliyon.2024.e34145.

A. Kachhi, S. Chaturvedi, H. A. Patil, and D. K. Singh, “Data Augmentation for Infant Cry Classification,” in 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 433–437. doi: 10.1109/ISCSLP57327.2022.10037931.

H. Kheddar, M. Hemis, and Y. Himeur, “Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey,” Mar. 2024, doi: 10.1016/j.inffus.2024.102422.

G.-V. Morfi, “Automatic detection and classification of bird sounds in low-resource wildlife audio datasets,” 2019.

H. N. Ting, Y. M. Choo, and A. Ahmad Kamar, “Classification of asphyxia infant cry using hybrid speech features and deep learning models,” Expert Syst Appl, vol. 208, Dec. 2022, doi: 10.1016/j.eswa.2022.118064.

T. Ozseven, “Infant cry classification by using different deep neural network models and hand-crafted features,” Biomed Signal Process Control, vol. 83, May 2023, doi: 10.1016/j.bspc.2023.104648.

T. Ozseven, “A Review of Infant Cry Recognition and Classification based on Computer-Aided Diagnoses,” in HORA 2022 - 4th International Congress on Human-Computer Interaction, Optimization and Robotic Applications, Proceedings, Institute of Electrical and Electronics Engineers Inc., 2022. doi: 10.1109/HORA55278.2022.9800038.

G. Coro, S. Bardelli, A. Cuttano, R. T. Scaramuzzo, and M. Ciantelli, “A self-training automatic infant-cry detector,” Neural Comput Appl, vol. 35, no. 11, pp. 8543–8559, Apr. 2023, doi: 10.1007/s00521-022-08129-w.

J. Li, M. Hasegawa-Johnson, and N. L. McElwain, “Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations,” Speech Commun, vol. 133, pp. 41–61, Oct. 2021, doi: 10.1016/j.specom.2021.07.010.

C. Ji and Y. Pan, “Infant Vocal Tract Development Analysis and Diagnosis by Cry Signals with CNN Age Classification.”

C. Ji, “Infant Cry Signal Processing, Analysis, and Classification with Artificial Neural Networks,” Dissertation, 2021, doi: 10.57709/25943253.

Y. Zayed, A. Hasasneh, and C. Tadj, “Infant Cry Signal Diagnostic System Using Deep Learning and Fused Features,” Diagnostics, vol. 13, no. 12, Jun. 2023, doi: 10.3390/diagnostics13122107.

F. Anders, M. Hlawitschka, and M. Fuchs, “Comparison of artificial neural network types for infant vocalization classification,” IEEE/ACM Trans Audio Speech Lang Process, vol. 29, pp. 54–67, 2021, doi: 10.1109/TASLP.2020.3037414.

F. Anders, M. Hlawitschka, and M. Fuchs, “Automatic classification of infant vocalization sequences with convolutional neural networks,” Speech Commun, vol. 119, pp. 36–45, May 2020, doi: 10.1016/j.specom.2020.03.003.

K. R. Mannem, E. Mengiste, S. Hasan, B. G. de Soto, and R. Sacks, “Smart audio signal classification for tracking of construction tasks,” Autom Constr, vol. 165, Sep. 2024, doi: 10.1016/j.autcon.2024.105485.

S. Purkovic et al., “Audio analysis with convolutional neural networks and boosting algorithms tuned by metaheuristics for respiratory condition classification,” Journal of King Saud University - Computer and Information Sciences, vol. 36, no. 10, Dec. 2024, doi: 10.1016/j.jksuci.2024.102261.

H. S. Alar, R. O. Mamaril, L. P. Villegas, and J. R. D. Cabarrubias, “Audio classification of violin bowing techniques: An aid for beginners,” Machine Learning with Applications, vol. 4, p. 100028, Jun. 2021, doi: 10.1016/j.mlwa.2021.100028.

D. A. Villamizar, D. G. Muratore, J. B. Wieser, and B. Murmann, “An 800 nW switched-capacitor feature extraction filterbank for sound classification,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 4, pp. 1578–1588, Apr. 2021, doi: 10.1109/TCSI.2020.3047035.

A. Gorin, C. Subakan, S. Abdoli, J. Wang, S. Latremouille, and C. Onu, “Self-Supervised Learning for Infant Cry Analysis,” in ICASSPW 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers Inc., 2023. doi: 10.1109/ICASSPW59220.2023.10193421.

C. Ji, T. B. Mudiyanselage, Y. Gao, and Y. Pan, “A review of infant cry analysis and classification,” Dec. 01, 2021, Springer Science and Business Media Deutschland GmbH. doi: 10.1186/s13636-021-00197-5.

X. Yao, M. Micheletti, M. Johnson, E. Thomaz, and K. de Barbaro, “INFANT CRYING DETECTION IN REAL-WORLD ENVIRONMENTS,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 131–135. doi: 10.1109/ICASSP43922.2022.9746096.

A. Sharma and D. Malhotra, “Speech recognition based IICC - Intelligent infant cry classifier,” in Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology, ICSSIT 2020, Institute of Electrical and Electronics Engineers Inc., Aug. 2020, pp. 992–998. doi: 10.1109/ICSSIT48917.2020.9214193.

D. Budaghyan, C. C. Onu, A. Gorin, C. Subakan, and D. Precup, “CRYCELEB: A SPEAKER VERIFICATION DATASET BASED ON INFANT CRY SOUNDS,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2024, pp. 11966–11970. doi: 10.1109/ICASSP48485.2024.10448292.

V. R. Joshi, K. Srinivasan, P. M. D. R. Vincent, V. Rajinikanth, and C. Y. Chang, “A Multistage Heterogeneous Stacking Ensemble Model for Augmented Infant Cry Classification,” Front Public Health, vol. 10, Mar. 2022, doi: 10.3389/fpubh.2022.819865.

L. F. A. O. Pellicer, T. M. Ferreira, and A. H. R. Costa, “Data augmentation techniques in natural language processing,” Appl Soft Comput, vol. 132, Jan. 2023, doi: 10.1016/j.asoc.2022.109803.

G. Maguolo, M. Paci, L. Nanni, and L. Bonan, “Audiogmenter: a MATLAB toolbox for audio data augmentation,” Applied Computing and Informatics, 2021, doi: 10.1108/ACI-03-2021-0064.

M. Y. Yiwere, A. Barcovschi, R. Jain, H. Cucu, and P. Corcoran, “Augmentation Techniques for Adult-Speech to Generate Child-Like Speech Data Samples at Scale,” IEEE Access, vol. 11, pp. 109066–109081, 2023, doi: 10.1109/ACCESS.2023.3317360.

Y. Ozer and M. Muller, “Source Separation of Piano Concertos Using Musically Motivated Augmentation Techniques,” IEEE/ACM Trans Audio Speech Lang Process, vol. 32, pp. 1214–1225, 2024, doi: 10.1109/TASLP.2024.3356980.

A. Chatziagapi et al., “Data augmentation using GANs for speech emotion recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2019, pp. 171–175. doi: 10.21437/Interspeech.2019-2561.

D. Budaghyan, C. C. Onu, A. Gorin, C. Subakan, and D. Precup, “CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds,” May 2023, [Online]. Available: http://arxiv.org/abs/2305.00969

B. Li, H. Fei, F. Li, T. Chua, and D. Ji, “Multimodal Emotion-Cause Pair Extraction with Holistic Interaction and Label Constraint,” ACM Transactions on Multimedia Computing, Communications, and Applications, Aug. 2024, doi: 10.1145/3689646.

Y. Li, J. Chan, G. Peko, and D. Sundaram, “Mixed emotion extraction analysis and visualisation of social media text,” Data Knowl Eng, vol. 148, Nov. 2023, doi: 10.1016/j.datak.2023.102220.

R. Alharbi, “MF-Saudi: A multimodal framework for bridging the gap between audio and textual data for Saudi dialect detection,” Journal of King Saud University - Computer and Information Sciences, vol. 36, no. 6, Jul. 2024, doi: 10.1016/j.jksuci.2024.102084.

Z. Firas, A. A. Nashaat, and G. Ahmad, “Optimizing Infant Cry Recognition: A Fusion of LPC and MFCC Features in Deep Learning Models,” in International Conference on Advances in Biomedical Engineering, ICABME, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 232–237. doi: 10.1109/ICABME59496.2023.10293083.

A. Abbaskhah, H. Sedighi, and H. Marvi, “Infant cry classification by MFCC feature extraction with MLP and CNN structures,” Biomed Signal Process Control, vol. 86, Sep. 2023, doi: 10.1016/j.bspc.2023.105261.

A. S. Podda, R. Balia, L. Pompianu, S. Carta, G. Fenu, and R. Saia, “CARgram: CNN-based accident recognition from road sounds through intensity-projected spectrogram analysis,” Digital Signal Processing: A Review Journal, vol. 147, Apr. 2024, doi: 10.1016/j.dsp.2024.104431.

E. Todt and B. A. Krinski, “Introduction CNN Layers CNN Models Popular Frameworks Papers References Convolutional Neural Network-CNN,” 2019.

T. Nadia Maghfira, T. Basaruddin, and A. Krisnadhi, “Infant cry classification using CNN - RNN,” in Journal of Physics: Conference Series, Institute of Physics Publishing, Jun. 2020. doi: 10.1088/1742-6596/1528/1/012019.

R. Jahangir, “CNN-SCNet: A CNN net-based deep learning framework for infant cry detection in household setting,” Engineering Reports, 2023, doi: 10.1002/eng2.12786.

X. Yu, X. Zhao, C. Lu, L. Wang, X. Long, and W. Chen, “An investigation into audio features and DTW algorithms for infant cry classification,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Nov. 2019, pp. 54–59. doi: 10.1145/3375923.3375929.

A. M. Mahmoud, S. M. Swilem, A. S. Alqarni, and F. Haron, “Infant Cry Classification Using Semi-supervised K-Nearest Neighbor Approach,” in Proceedings - International Conference on Developments in eSystems Engineering, DeSE, Institute of Electrical and Electronics Engineers Inc., Dec. 2020, pp. 305–310. doi: 10.1109/DeSE51703.2020.9450239.

L. Liu, W. Li, X. Wu, and B. X. Zhou, “Infant cry language analysis and recognition: An experimental approach,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. 778–788, May 2019, doi: 10.1109/JAS.2019.1911435.

A. Abbasi, A. R. R. Javed, A. Yasin, Z. Jalil, N. Kryvinska, and U. Tariq, “A Large-Scale Benchmark Dataset for Anomaly Detection and Rare Event Classification for Audio Forensics,” IEEE Access, vol. 10, pp. 38885–38894, 2022, doi: 10.1109/ACCESS.2022.3166602.

A. Ekinci and E. Küçükkülahli, “Classification of Baby Cries Using Machine Learning Algorithms,” 2023.

G. Felipe1 et al., “Identification of Infants’ Cry Motivation Using Spectrograms.” [Online]. Available: https://sourceforge.net/projects/sox/

R. Garg, “‘its Changes so Often’: Parental Non-/Use of Mobile Devices while Caring for Infants and Toddlers at Home,” Proc ACM Hum Comput Interact, vol. 5, no. CSCW2, Oct. 2021, doi: 10.1145/3479513.

K. Rezaee, H. G. Zadeh, L. Qi, H. Rabiee, and M. R. Khosravi, “Can You Understand Why I Am Crying? A Decision-making System for Classifying Infants’ Cry Languages Based on DeepSVM Model,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 1, Jan. 2024, doi: 10.1145/3579032.

M. Hammoud, M. N. Getahun, A. Baldycheva, and A. Somov, “Machine learning-based infant crying interpretation,” Front Artif Intell, vol. 7, 2024, doi: 10.3389/frai.2024.1337356.

M. Charola, A. Kachhi, and H. A. Patil, “Whisper Encoder features for Infant Cry Classification,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2023, pp. 1773–1777. doi: 10.21437/Interspeech.2023-1916.

A. Gorin, C. Subakan, S. Abdoli, J. Wang, S. Latremouille, and C. Onu, “Self-supervised learning for infant cry analysis,” May 2023, [Online]. Available: http://arxiv.org/abs/2305.01578

A. S. Kumar, T. Schlosser, S. Kahl, and D. Kowerko, “Improving learning-based birdsong classification by utilizing combined audio augmentation strategies,” Ecol Inform, vol. 82, Sep. 2024, doi: 10.1016/j.ecoinf.2024.102699.

A. Ayari, H. Hamdi, and K. A. Alsulbi, “E-health Application In IoMT Environment Deployed in An Edge And Cloud Computing Platforms,” in Procedia Computer Science, Elsevier B.V., 2024, pp. 1019–1028. doi: 10.1016/j.procs.2024.09.521.

X. He and Q. Zhang, “Cloud Computing Based Digital Media Content Distribution Technology,” in Procedia Computer Science, Elsevier B.V., 2023, pp. 461–468. doi: 10.1016/j.procs.2024.10.055.

H. Malik, U. Bashir, and A. Ahmad, “Multi-classification neural network model for detection of abnormal heartbeat audio signals,” Biomedical Engineering Advances, vol. 4, p. 100048, Dec. 2022, doi: 10.1016/j.bea.2022.100048.

D. Vasconcelos, N. J. Nunes, A. Förster, and J. P. Gomes, “Optimal 2D audio features estimation for a lightweight application in mosquitoes species: Ecoacoustics detection and classification purposes,” Comput Biol Med, vol. 168, Jan. 2024, doi: 10.1016/j.compbiomed.2023.107787.

H. Choi, L. Zhang, and C. Watkins, “Dual representations: A novel variant of Self-Supervised Audio Spectrogram Transformer with multi-layer feature fusion and pooling combinations for sound classification,” Neurocomputing, vol. 623, Mar. 2025, doi: 10.1016/j.neucom.2025.129415.

Additional Files

Published

2025-05-17

How to Cite

[1]
N. G. Setyoningrum, E. Utami, K. Kusrini, and F. W. Wibowo, “Improving Infant Cry Recognition Using MFCC And CNN-Based Audio Augmentation”, J. Tek. Inform. (JUTIF), vol. 6, no. 2, pp. 995–1016, May 2025.

Most read articles by the same author(s)