Voice Cloning Using Transfer Learning with Audio Samples

  • Usman Nawaz Department of Engineering, University of Palermo, Palermo, Italy
  • Usman Ahmed Raza Department of Computer Science, University of Engineering and Technology, Lahore, Pakistan
  • Amjad Farooq Department of Computer Science, University of Engineering and Technology, Lahore, Pakistan
  • Muhammad Junaid Iqbal Department of Data Science, University of Rome, Tor Vergata, Rome, Italy
  • Ammara Tariq Department of Biochemistry and biotechnology, University of Gujrat, Gujrat, Pakistan
Keywords: artificial intelligence, audio cloning, machine learning, natural language processing, text to speech, voice recognition


Abstract Views: 0

Voice cloning refers to the artificial replication of a certain human voice. Several deep learning approaches were studied for voice cloning. After studying learning approaches, a cloning system was offered that creates natural-sounding audio samples within few seconds of source speech from the target speaker. From a speaker verification challenge to text-to-speech synthesis with multi-speaker capability, the current study used a transfer learning technique. In a zero-shot mode, this system creates speech sounds in the voices of various speakers, even individuals who were not seen during the training process. The current study used latent embedding’s to encode speaker-specific information, enabling additional model parameters to be pooled across all speakers. The speaker modelling stage was separated from voice synthesis by training a discrete speaker-discriminative encoder network. This is because networks require distinct types of input, disconnection enables each to be trained using separate datasets. When employed for zero-shot adaptability to unknown speakers, an embedding-based technique for voice cloning enhances speaker resemblance. Furthermore, it reduces computational resource needs which may be advantageous for use-cases requiring minimal resource deployment.


Download data is not yet available.


A. Basnet, “Attention and wave net vocoder based Nepali text-to-speech synthesis,” Master thesis, Inst. Eng., Tribhuv. Univ., Nepal, 2021. [Online]. Available: https://elibrary.tucl.edu. np/handle/123456789/7668

W. Hu and X. Zhu, “A real-time voice cloning system with multiple algorithms for speech quality improvement,” PloS One, vol. 18, no. 4, Art. no. 0283440, 2023, doi: https://doi.org/10.1371/journal.pone.0283440

J.-M. Valin and J. Skoglund, “A real-time wideband neural vocoder at 1.6 kb/s Using LPCNet,” arXiv, June 27, 2019, doi: https://doi.org/10.48550 /arXiv.1903.12087

C. Koutlis, M. Schinas, and S. Papadopoulos, “MemeTector: Enforcing deep focus for meme detection,” Int. J. Multimed. Inf. Retr., vol. 12, no. 1, Art. no. 11, Jun. 2023. doi: https://doi.org/10.1007/s13735-023-00277-6

Z. Weng, Z. Qin, X. Tao, C. Pan, G. Liu, and G. Y. Li, “Deep learning enabled semantic communications with speech recognition and synthesis,” IEEE Trans. Wirel. Commun., vol. 22, no. 9, pp. 6227–6240, Sept. 2023, doi: https://doi.org/ 10.1109/TWC.2023.3240969

Z. Kons, S. Shechtman, A. Sorin, R. Hoory, C. Rabinovitz, and E. D. S. Morais, “Neural TTS voice conversion,” presented at IEEE Spoken Language Technology Workshop (SLT), 2018, Greece, Dec. 18–21, 2018, doi: https://doi.org/ 10.1109/SLT.2018.8639550

H. Malik, “Securing voice-driven interfaces against fake (cloned) audio attacks,” in IEEE Conf. Multimed. Info. Process. Retrieval (MIPR), IEEE, 2019, pp. 512–517, doi: https://doi.org/ 10.1109/MIPR.2019.00104

A. E. P. Zepedda, “Procedure of translation, transliteration and transcription,” Appl. Transl., vol. 14, no. 2, pp. 8–13, 2020, doi: https://doi.org/10.51708/apptrans.v14n2.1203

S. Jung and H. Kim, “Neural voice cloning with a few low-quality samples.” arXiv, June 12, 2020, doi: https://doi.org/10.48550/arXiv.2006.06940

J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 4779–4783, doi: https://doi. org/10.1109/ICASSP.2018.8461368

Y. Jia et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Adv. Neural Inf. Process. Syst., vol. 31, pp. 1–11, 2018.

P. Neekhara, S. Hussain, S. Dubnov, F. Koushanfar, and J. McAuley, “Expressive neural voice cloning,” in Asian Conf. Mach. Learn., 2021, pp. 252–267.

J. Cong, S. Yang, L. Xie, G. Yu, and G. Wan, “Data efficient voice cloning from noisy samples with domain adversarial training,” arXiv, Aug. 10, 2020, doi: https://doi.org/10.48550 /arXiv.2008.04265

H.-T. Luong and J. Yamagishi, “Latent linguistic embedding for cross-lingual text-to-speech and voice conversion.” arXiv, Oct. 7, 2020, doi: https://doi. org/10.48550/arXiv.2010.03717

C.-M. Chien, J.-H. Lin, C. Huang, P. Hsu, and H. Lee, “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech,” in IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 8588–8592, doi: https://doi.org/10.1109/ICASSP39728.2021.9413880

X. Zhou, H. Che, X. Wang, and L. Xie, “A novel cross-lingual voice cloning approach with a few text-free samples,” arXiv, Oct. 30, 2019, doi: https://doi.org/10.48550/arXiv.1910.13276

How to Cite
Nawaz, U., Raza, U. A., Farooq, A., Iqbal, M. J., & Tariq, A. (2023). Voice Cloning Using Transfer Learning with Audio Samples. UMT Artificial Intelligence Review, 3(2). https://doi.org/10.32350/umt-air.32.04