Voice Cloning Using Transfer Learning with Audio Samples

Usman  Nawaz; Usman Ahmed Raza; Amjad  Farooq; Muhammad Junaid  Iqbal; Ammara  Tariq

doi:10.32350/umt-air.32.04

Usman Nawaz Department of Engineering, University of Palermo, Palermo, Italy
Usman Ahmed Raza Department of Computer Science, University of Engineering and Technology, Lahore, Pakistan
Amjad Farooq Department of Computer Science, University of Engineering and Technology, Lahore, Pakistan
Muhammad Junaid Iqbal Department of Data Science, University of Rome, Tor Vergata, Rome, Italy
Ammara Tariq Department of Biochemistry and biotechnology, University of Gujrat, Gujrat, Pakistan

DOI: https://doi.org/10.32350/umt-air.32.04

Keywords: artificial intelligence, audio cloning, machine learning, natural language processing, text to speech, voice recognition

Abstract

Abstract Views: 0

Voice cloning refers to the artificial replication of a certain human voice. Several deep learning approaches were studied for voice cloning. After studying learning approaches, a cloning system was offered that creates natural-sounding audio samples within few seconds of source speech from the target speaker. From a speaker verification challenge to text-to-speech synthesis with multi-speaker capability, the current study used a transfer learning technique. In a zero-shot mode, this system creates speech sounds in the voices of various speakers, even individuals who were not seen during the training process. The current study used latent embedding’s to encode speaker-specific information, enabling additional model parameters to be pooled across all speakers. The speaker modelling stage was separated from voice synthesis by training a discrete speaker-discriminative encoder network. This is because networks require distinct types of input, disconnection enables each to be trained using separate datasets. When employed for zero-shot adaptability to unknown speakers, an embedding-based technique for voice cloning enhances speaker resemblance. Furthermore, it reduces computational resource needs which may be advantageous for use-cases requiring minimal resource deployment.

Downloads

Download data is not yet available.

References

A. Basnet, “Attention and wave net vocoder based Nepali text-to-speech synthesis,” Master thesis, Inst. Eng., Tribhuv. Univ., Nepal, 2021. [Online]. Available: https://elibrary.tucl.edu. np/handle/123456789/7668

W. Hu and X. Zhu, “A real-time voice cloning system with multiple algorithms for speech quality improvement,” PloS One, vol. 18, no. 4, Art. no. 0283440, 2023, doi: https://doi.org/10.1371/journal.pone.0283440

J.-M. Valin and J. Skoglund, “A real-time wideband neural vocoder at 1.6 kb/s Using LPCNet,” arXiv, June 27, 2019, doi: https://doi.org/10.48550 /arXiv.1903.12087

C. Koutlis, M. Schinas, and S. Papadopoulos, “MemeTector: Enforcing deep focus for meme detection,” Int. J. Multimed. Inf. Retr., vol. 12, no. 1, Art. no. 11, Jun. 2023. doi: https://doi.org/10.1007/s13735-023-00277-6

Z. Weng, Z. Qin, X. Tao, C. Pan, G. Liu, and G. Y. Li, “Deep learning enabled semantic communications with speech recognition and synthesis,” IEEE Trans. Wirel. Commun., vol. 22, no. 9, pp. 6227–6240, Sept. 2023, doi: https://doi.org/ 10.1109/TWC.2023.3240969

Z. Kons, S. Shechtman, A. Sorin, R. Hoory, C. Rabinovitz, and E. D. S. Morais, “Neural TTS voice conversion,” presented at IEEE Spoken Language Technology Workshop (SLT), 2018, Greece, Dec. 18–21, 2018, doi: https://doi.org/ 10.1109/SLT.2018.8639550

H. Malik, “Securing voice-driven interfaces against fake (cloned) audio attacks,” in IEEE Conf. Multimed. Info. Process. Retrieval (MIPR), IEEE, 2019, pp. 512–517, doi: https://doi.org/ 10.1109/MIPR.2019.00104

A. E. P. Zepedda, “Procedure of translation, transliteration and transcription,” Appl. Transl., vol. 14, no. 2, pp. 8–13, 2020, doi: https://doi.org/10.51708/apptrans.v14n2.1203

S. Jung and H. Kim, “Neural voice cloning with a few low-quality samples.” arXiv, June 12, 2020, doi: https://doi.org/10.48550/arXiv.2006.06940

J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 4779–4783, doi: https://doi. org/10.1109/ICASSP.2018.8461368

Y. Jia et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Adv. Neural Inf. Process. Syst., vol. 31, pp. 1–11, 2018.

P. Neekhara, S. Hussain, S. Dubnov, F. Koushanfar, and J. McAuley, “Expressive neural voice cloning,” in Asian Conf. Mach. Learn., 2021, pp. 252–267.

J. Cong, S. Yang, L. Xie, G. Yu, and G. Wan, “Data efficient voice cloning from noisy samples with domain adversarial training,” arXiv, Aug. 10, 2020, doi: https://doi.org/10.48550 /arXiv.2008.04265

H.-T. Luong and J. Yamagishi, “Latent linguistic embedding for cross-lingual text-to-speech and voice conversion.” arXiv, Oct. 7, 2020, doi: https://doi. org/10.48550/arXiv.2010.03717

C.-M. Chien, J.-H. Lin, C. Huang, P. Hsu, and H. Lee, “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech,” in IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 8588–8592, doi: https://doi.org/10.1109/ICASSP39728.2021.9413880

X. Zhou, H. Che, X. Wang, and L. Xie, “A novel cross-lingual voice cloning approach with a few text-free samples,” arXiv, Oct. 30, 2019, doi: https://doi.org/10.48550/arXiv.1910.13276