Evaluation Study For Worthwhile Research In Artificial Intelligence Techniques For Tongue Movement’s Estimation

Authors

DOI:

https://doi.org/10.24237/

Keywords:

Tongue Movements, Deep Learning, Real-time video, Speech Processing, Tongue Contour

Abstract

The introduction of deep learning has brought about worthy changes in the field of speech processing. By utilizing many processing layers, models have been developed that can estimate tongue motions and extract complex information from speech data. This review study overviews the main deep learning models and their applications in tongue movement estimation function using real-time video sequences. In order to assess the relevant literature, a literature review was performed. All papers published between 2017 and 2023 that discussed methods for using deep learning techniques that were pertinent to this research were considered. After going over each article in detail, 25 of the many found met the inclusion criteria. Relevant articles were found using searches in Google Scholar, IEEE Xplore, and Scopus. This study's findings highlight a significant challenge to improving deep learning network performance: a dataset with real-time video sequences of tongue movements. Such a dataset is essential for developing automatic speech processing and high-accuracy estimation of tongue movements.

References

K. Al-hammuri, F. Gebali, I. Thirumarai Chelvan, and A. Kanan, “Tongue Contour Tracking and Segmentation in Lingual Ultrasound for Speech Recognition: A Review,” Diagnostics, vol. 12, no. 11. 2022, doi: 10.3390/diagnostics12112811.

C. Laporte and L. Ménard, “Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of normal and impaired speech,” Med. Image Anal., vol. 44, pp. 98–114, 2018, doi: 10.1016/j.media.2017.12.003.

A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria, “A review of deep learning techniques for speech processing,” Inf. Fusion, vol. 99, 2023, doi: 10.1016/j.inffus.2023.101869.

“Systematic review of deep learning models in ultrasound tongue imaging for the detection of speech disorders,” techrxiv.org, 2023, doi: 10.36227/techrxiv.22699291.v1.

V. Ramanarayanan et al., “Analysis of speech production real-time MRI,” Computer Speech and Language, vol. 52. pp. 1–22, 2018, doi: 10.1016/j.csl.2018.04.002.

Ö. D. Köse and M. Saraçlar, “Multimodal Representations for Synchronized Speech and Real-Time MRI Video Processing,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 1912–1924, 2021, doi: 10.1109/TASLP.2021.3084099.

K. Isaieva, Y. Laprie, A. Houssard, J. Felblinger, and P.-A. Vuissoz, “Tracking the tongue contours in rt-MRI films with an autoencoder DNN approach,” 2020.

Z. Zhao, Y. Lim, D. Byrd, S. Narayanan, and K. S. Nayak, “Improved 3D real-time MRI of speech production,” Magnetic Resonance in Medicine, vol. 85, no. 6. pp. 3182–3195, 2021, doi: 10.1002/mrm.28651.

I. P. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, “Attention Is All You Need,” [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

M. Feng, Y. Wang, K. Xu, H. Wang, and B. Ding, “Improving ultrasound tongue contour extraction using u-net and shape consistency-based regularizer,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021, vol. 2021-June, pp. 6443–6447, doi: 10.1109/ICASSP39728.2021.9414420.

C. Wu, S. Chen, G. Sheng, P. Roussel, and B. Denby, “Predicting Tongue Motion in Unlabeled Ultrasound Video Using 3D Convolutional Neural Networks,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018, vol. 2018-April, pp. 5764–5768, doi: 10.1109/ICASSP.2018.8461957.

P. Saha, Y. Liu, B. Gick, and S. Fels, “Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12263 LNCS. pp. 473–482, 2020, doi: 10.1007/978-3-030-59716-0_45.

H. Liu and J. Zhang, “Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism,” Jun. 2021, doi: 10.48550/arXiv.2106.11769 Focus to learn more.

L. Tóth and A. H. Shandiz, “3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces,” 2020, pp. 159–169, doi: 10.1007/978-3-030-61401-0_16.

M. H. Mozaffari, M. A. R. Ratul, and W.-S. Lee, “IrisNet: Deep Learning for Automatic and Real-time Tongue Contour Tracking in Ultrasound Video Data using Peripheral Vision.” 2019, [Online]. Available: http://arxiv.org/abs/1911.03972.

P. Padmini, D. Gupta, M. Zakariah, Y. A. Alotaibi, and K. Bhowmick, “A simple speech production system based on formant estimation of a tongue articulatory system using human tongue orientation,” IEEE Access, vol. 9, pp. 4688–4710, 2021, doi: 10.1109/ACCESS.2020.3048076.

T. G. Csapó, F. V. Arthur, P. Nagy, and Á. Boncz, “Comparison of acoustic-to-articulatory and brain-to-articulatory mapping during speech production using ultrasound tongue imaging and EEG,” in SMM23, Workshop on Speech, Music and Mind 2023, Aug. 2023, pp. 16–20, doi: 10.21437/SMM.2023-4.

L. Tóth, A. Honarmandi Shandiz, G. Gosztolya, and T. G. Csapó, “Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks,” in INTERSPEECH 2023, Aug. 2023, pp. 1169–1173, doi: 10.21437/Interspeech.2023-1607.

C. Kroos, “AUDITORY-VISUAL SPEECH ANALYSIS : IN SEARCH OF A THEORY,” no. August, pp. 6–10, 2007.

Z. Shi, “A Survey on Audio Synthesis and Audio-Visual Multimodal Processing.” 2021, [Online]. Available: http://arxiv.org/abs/2108.00443.

L. S. Memory, Y. Lu, and H. Li, “applied sciences Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based,” 2019.

K. Al-Hammuri, “Computer vision-based tracking and feature extraction for lingual ultrasound.” 2019, [Online]. Available: http://hdl.handle.net/1828/10812.

A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria, “A review of deep learning techniques for speech processing,” Inf. Fusion, vol. 99, Nov. 2023, doi: 10.1016/j.inffus.2023.101869.

J. M. Porta, J. J. Verbeek, and B. J. A. Kröse, “Active appearance-based robot localization using stereo vision,” Auton. Robots, vol. 18, no. 1, pp. 59–80, Jan. 2005, doi: 10.1023/B:AURO.0000047287.00119.b6.

D. Fabre, T. Hueber, F. Bocquelet, and P. Badin, “Tongue tracking in ultrasound images using eigentongue decomposition and artificial neural networks,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015, vol. 2015-Janua, pp. 2410–2414, doi: 10.21437/interspeech.2015-521.

S. Wen, “Automatic Tongue Contour Segmentation using Deep Learning,” 2018.

M. H. Mozaffari, N. Yamane, and W. S. Lee, “Deep Learning for Automatic Tracking of Tongue Surface in Real-time Ultrasound Videos, Landmarks instead of Contours,” in Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020, 2020, pp. 2785–2792, doi: 10.1109/BIBM49941.2020.9313262.

I. Fasel and J. Berry, “Deep belief networks for real-time extraction of tongue contours from ultrasound during speech,” Proceedings - International Conference on Pattern Recognition. pp. 1493–1496, 2010, doi: 10.1109/ICPR.2010.369.

A. Dhillon and G. K. Verma, “Convolutional neural network: a review of models, methodologies and applications to object detection,” Progress in Artificial Intelligence, vol. 9, no. 2. pp. 85–112, 2020, doi: 10.1007/s13748-019-00203-0.

A. Graves, “Sequence Transduction with Recurrent Neural Networks Alex.” 2012, [Online]. Available: http://arxiv.org/abs/1211.3711.

P. B. Weerakody, K. W. Wong, G. Wang, and W. Ela, “A review of irregular time series data handling with gated recurrent neural networks,” Neurocomputing, vol. 441, pp. 161–178, 2021, doi: 10.1016/j.neucom.2021.02.046.

D. B. Chklovskii, “A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures Yong,” vol. 2954. pp. 2925–2954, 2017.

A. H. Shandiz and L. Tóth, “Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2022, vol. 13343 LNAI, pp. 265–274, doi: 10.1007/978-3-031-08530-7_22.

G. Van Houdt, C. Mosquera, and G. Nápoles, “A review on the long short-term memory model,” Artif. Intell. Rev., vol. 53, no. 8, pp. 5929–5955, 2020, doi: 10.1007/s10462-020-09838-1.

Y. Feng and X. Wang, “Ultrasound tongue image classification using transfer learning,” ACM International Conference Proceeding Series. pp. 38–42, 2019, doi: 10.1145/3379299.3379301.

T. M. K. and D. W. Karl Weiss, “A survey of transfer learning | SpringerLink,” Journal of Big Data, vol. 3, no. 1. p. 9, 2016, [Online]. Available: http://link.springer.com/article/10.1186/s40537-016-0043-6.

H. Daum??, “Frustratingly easy domain adaptation,” ACL 2007 - Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. pp. 256–263, 2007.

J.-X. Zhang, K. Richmond, Z.-H. Ling, and L. Dai, “TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis,” Proc. AAAI Conf. Artif. Intell., vol. 35, no. 16, pp. 14402–14410, May 2021, doi: 10.1609/aaai.v35i16.17693.

A. de Myttenaere, B. Golden, B. Le Grand, and F. Rossi, “Mean Absolute Percentage Error for regression models,” Neurocomputing, vol. 192, pp. 38–48, 2016, doi: 10.1016/j.neucom.2015.12.114.

D. Chicco, M. J. Warrens, and G. Jurman, “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation,” PeerJ Comput. Sci., vol. 7, 2021, doi: 10.7717/PEERJ-CS.623.

A. Gondia, A. Siam, W. El-Dakhakhni, and A. H. Nassar, “Machine Learning Algorithms for Construction Projects Delay Risk Prediction,” J. Constr. Eng. Manag., vol. 146, no. 1, p. 04019085, Jan. 2020, doi: 10.1061/(ASCE)CO.1943-7862.0001736.

I. Bakurov, M. Buzzelli, R. Schettini, M. Castelli, and L. Vanneschi, “Structural similarity index (SSIM) revisited: A data-driven approach,” Expert Syst. Appl., vol. 189, 2022, doi: 10.1016/j.eswa.2021.116087.

P. Shah et al., “Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?,” 2022, [Online]. Available: http://arxiv.org/abs/2203.16601.

D. Chicco, M. J. Warrens, and G. Jurman, “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation,” PeerJ Comput. Sci., vol. 7, pp. 1–24, 2021, doi: 10.7717/PEERJ-CS.623.

B. Cao, M. Kim, J. Van Santen, T. Mau, and J. Wang, “Integrating articulatory information in deep learning-based text-To-speech synthesis,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017, vol. 2017-Augus, pp. 254–258, doi: 10.21437/Interspeech.2017-1762.

C. Kroos, R. L. Bundgaard-Nielsen, C. T. Best, and M. D. Plumbley, “Using deep neural networks to estimate tongue movements from speech face motion,” in 14th International Conference on Auditory-Visual Speech Processing, AVSP 2017, 2017, pp. 30–35, doi: 10.21437/AVSP.2017-7.

J. Woo et al., “Speech Map: a statistical multimodal atlas of 4D tongue motion during speech from tagged and cine MR images,” Comput. Methods Biomech. Biomed. Eng. Imaging Vis., vol. 7, no. 4, pp. 361–373, 2019, doi: 10.1080/21681163.2017.1382393.

S. Chen, Y. Zheng, C. Wu, G. Sheng, P. Roussel, and B. Denby, “Direct, Near Real Time Animation of a 3D Tongue Model Using Non-Invasive Ultrasound Images,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018, vol. 2018-April, pp. 4994–4998, doi: 10.1109/ICASSP.2018.8462096.

H. Akbari, H. Arora, L. Cao, and N. Mesgarani, “Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018, vol. 2018-April, pp. 2516–2520, doi: 10.1109/ICASSP.2018.8461856.

T. Grosz, G. Gosztolya, L. Toth, T. G. Csapo, and A. Marko, “F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April. pp. 291–295, 2018, doi: 10.1109/ICASSP.2018.8461732.

C. Zhao, P. Zhang, J. Zhu, C. Wu, H. Wang, and K. Xu, “Predicting Tongue Motion in Unlabeled Ultrasound Videos Using Convolutional Lstm Neural Networks,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May. pp. 5926–5930, 2019, doi: 10.1109/ICASSP.2019.8683081.

J. Zhu, W. Styler, and I. Calloway, “A CNN-based tool for automatic tongue contour tracking in ultrasound images,” 2019, [Online]. Available: http://arxiv.org/abs/1907.10210.

B. Li, K. Xu, D. Feng, H. Mi, H. Wang, and J. Zhu, “Denoising Convolutional Autoencoder Based B-mode Ultrasound Tongue Image Feature Extraction,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2019, vol. 2019-May, pp. 7130–7134, doi: 10.1109/ICASSP.2019.8682806.

N. Sebkhi et al., “A Deep Neural Network-Based Permanent Magnet Localization for Tongue Tracking,” IEEE Sens. J., vol. 19, no. 20, pp. 9324–9331, 2019, doi: 10.1109/JSEN.2019.2923585.

D. Porras, A. Sepulveda-Sepulveda, and T. G. Csapo, “DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging,” in Proceedings of the International Joint Conference on Neural Networks, 2019, vol. 2019-July, doi: 10.1109/IJCNN.2019.8851769.

K. J. Teplansky et al., “Tongue and lip motion patterns in alaryngeal speech,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, vol. 2020-Octob, pp. 4576–4580, doi: 10.21437/Interspeech.2020-2854.

G. Gosztolya, T. Grósz, L. Tóth, A. Markó, and T. G. Csapó, “Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces,” Acta Polytech. Hungarica, vol. 17, no. 7, pp. 109–124, 2020, doi: 10.12700/APH.17.7.2020.7.6.

Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can We Read Speech beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition,” in Proceedings - 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020, 2020, pp. 356–363, doi: 10.1109/FG47880.2020.00134.

T. G. Csapó and K. Xu, “Quantification of transducer misalignment in ultrasound tongue imaging,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, vol. 2020-Octob, pp. 3735–3739, doi: 10.21437/Interspeech.2020-1672.

M. Hamed Mozaffari and W. S. Lee, “Encoder-decoder CNN models for automatic tracking of tongue contours in real-time ultrasound data,” Methods, vol. 179, pp. 26–36, 2020, doi: 10.1016/j.ymeth.2020.05.011.

M. S. Ribeiro, A. Eshky, K. Richmond, and S. Renals, “Silent versus modal multi-speaker speech recognition from ultrasound and video,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, vol. 1, pp. 466–470, doi: 10.21437/Interspeech.2021-23.

M. S. Ribeiro et al., “Tal: A Synchronised Multi-Speaker Corpus of Ultrasound Tongue Imaging, Audio, and Lip Videos,” in 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings, 2021, pp. 1109–1116, doi: 10.1109/SLT48900.2021.9383619.

P. Padmini, D. Gupta, M. Zakariah, Y. A. Alotaibi, and K. Bhowmick, “A simple speech production system based on formant estimation of a tongue articulatory system using human tongue orientation,” IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2020.3048076.

T. G. Csapó, L. Tóth, G. Gosztolya, and A. Markó, “Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input,” 2021, pp. 31–36, doi: 10.21437/ssw.2021-6.

C. Zheng, X. Peng, Y. Zhang, S. Srinivasan, and Y. Lu, “Interactive Speech and Noise Modeling for Speech Enhancement,” in 35th AAAI Conference on Artificial Intelligence, AAAI 2021, 2021, vol. 16, pp. 14549–14557, doi: 10.1609/aaai.v35i16.17710.

T. G. Csapó, “Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging,” 2021, pp. 7–12, doi: 10.21437/ssw.2021-2.

S. Medina et al., “Speech Driven Tongue Animation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, vol. 2022-June, pp. 20374–20384, doi: 10.1109/CVPR52688.2022.01976.

G. Li, J. Chen, Y. Liu, and J. Wei, “wUnet: A new network used for ultrasonic tongue contour extraction,” Speech Commun., vol. 141, pp. 68–79, 2022, doi: 10.1016/j.specom.2022.05.004.

L. McKeever, J. Cleland, and J. Delafield-Butt, “Using ultrasound tongue imaging to analyse maximum performance tasks in children with Autism: a pilot study,” Clin. Linguist. Phonetics, vol. 36, no. 2–3, pp. 127–145, 2022, doi: 10.1080/02699206.2021.1933186.

M. H. Mozaffari, M. A. R. Ratul, and W.-S. Lee, “IrisNet: Deep Learning for Automatic and Real-time Tongue Contour Tracking in Ultrasound Video Data using Peripheral Vision,” Nov. 2019, [Online]. Available: http://arxiv.org/abs/1911.03972.

T. G. Csapó, F. V. Arthur, P. Nagy, and Á. Boncz, “Towards Ultrasound Tongue Image prediction from EEG during speech production,” 2023, pp. 1164–1168, doi: 10.21437/interspeech.2023-40.

R.-C. Zheng, Y. Ai, and Z.-H. Ling, “Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation,” 2023, pp. 844–848, doi: 10.21437/interspeech.2023-780.

T. G. Csapó, F. V. Arthur, P. Nagy, and Á. Boncz, “Comparison of acoustic-to-articulatory and brain-to-articulatory mapping during speech production using ultrasound tongue imaging and EEG,” Sep. 2023, pp. 16–20, doi: 10.21437/smm.2023-4.

R.-C. Zheng, Y. Ai, and Z.-H. Ling, “Speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2023, pp. 1–5, doi: 10.1109/ICASSP49357.2023.10096920.

L. Tóth, A. H. Shandiz, G. Gosztolya, and T. G. Csapó, “Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023, vol. 2023-Augus, pp. 1169–1173, doi: 10.21437/Interspeech.2023-1607.

I. Ibrahimov, G. Gosztolya, and T. G. Csapo, “Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis,” in 12th ISCA Speech Synthesis Workshop (SSW2023), Aug. 2023, pp. 230–235, doi: 10.21437/ssw.2023-36.

Downloads

Published

2025-07-31

Issue

Section

Articles

How to Cite

Evaluation Study For Worthwhile Research In Artificial Intelligence Techniques For Tongue Movement’s Estimation. (2025). Academic Science Journal, 3(3), 106-135. https://doi.org/10.24237/