Voice Cloning Using Transfer Learning with Audio Samples
Abstract
Abstract Views: 0Voice cloning refers to the artificial replication of a certain human voice. Several deep learning approaches were studied for voice cloning. After studying learning approaches, a cloning system was offered that creates natural-sounding audio samples within few seconds of source speech from the target speaker. From a speaker verification challenge to text-to-speech synthesis with multi-speaker capability, the current study used a transfer learning technique. In a zero-shot mode, this system creates speech sounds in the voices of various speakers, even individuals who were not seen during the training process. The current study used latent embedding’s to encode speaker-specific information, enabling additional model parameters to be pooled across all speakers. The speaker modelling stage was separated from voice synthesis by training a discrete speaker-discriminative encoder network. This is because networks require distinct types of input, disconnection enables each to be trained using separate datasets. When employed for zero-shot adaptability to unknown speakers, an embedding-based technique for voice cloning enhances speaker resemblance. Furthermore, it reduces computational resource needs which may be advantageous for use-cases requiring minimal resource deployment.
Downloads
References
A. Basnet, “Attention and wave net vocoder based Nepali text-to-speech synthesis,” Master thesis, Inst. Eng., Tribhuv. Univ., Nepal, 2021. [Online]. Available: https://elibrary.tucl.edu. np/handle/123456789/7668
W. Hu and X. Zhu, “A real-time voice cloning system with multiple algorithms for speech quality improvement,” PloS One, vol. 18, no. 4, Art. no. 0283440, 2023, doi: https://doi.org/10.1371/journal.pone.0283440
J.-M. Valin and J. Skoglund, “A real-time wideband neural vocoder at 1.6 kb/s Using LPCNet,” arXiv, June 27, 2019, doi: https://doi.org/10.48550 /arXiv.1903.12087
C. Koutlis, M. Schinas, and S. Papadopoulos, “MemeTector: Enforcing deep focus for meme detection,” Int. J. Multimed. Inf. Retr., vol. 12, no. 1, Art. no. 11, Jun. 2023. doi: https://doi.org/10.1007/s13735-023-00277-6
Z. Weng, Z. Qin, X. Tao, C. Pan, G. Liu, and G. Y. Li, “Deep learning enabled semantic communications with speech recognition and synthesis,” IEEE Trans. Wirel. Commun., vol. 22, no. 9, pp. 6227–6240, Sept. 2023, doi: https://doi.org/ 10.1109/TWC.2023.3240969
Z. Kons, S. Shechtman, A. Sorin, R. Hoory, C. Rabinovitz, and E. D. S. Morais, “Neural TTS voice conversion,” presented at IEEE Spoken Language Technology Workshop (SLT), 2018, Greece, Dec. 18–21, 2018, doi: https://doi.org/ 10.1109/SLT.2018.8639550
H. Malik, “Securing voice-driven interfaces against fake (cloned) audio attacks,” in IEEE Conf. Multimed. Info. Process. Retrieval (MIPR), IEEE, 2019, pp. 512–517, doi: https://doi.org/ 10.1109/MIPR.2019.00104
A. E. P. Zepedda, “Procedure of translation, transliteration and transcription,” Appl. Transl., vol. 14, no. 2, pp. 8–13, 2020, doi: https://doi.org/10.51708/apptrans.v14n2.1203
S. Jung and H. Kim, “Neural voice cloning with a few low-quality samples.” arXiv, June 12, 2020, doi: https://doi.org/10.48550/arXiv.2006.06940
J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 4779–4783, doi: https://doi. org/10.1109/ICASSP.2018.8461368
Y. Jia et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Adv. Neural Inf. Process. Syst., vol. 31, pp. 1–11, 2018.
P. Neekhara, S. Hussain, S. Dubnov, F. Koushanfar, and J. McAuley, “Expressive neural voice cloning,” in Asian Conf. Mach. Learn., 2021, pp. 252–267.
J. Cong, S. Yang, L. Xie, G. Yu, and G. Wan, “Data efficient voice cloning from noisy samples with domain adversarial training,” arXiv, Aug. 10, 2020, doi: https://doi.org/10.48550 /arXiv.2008.04265
H.-T. Luong and J. Yamagishi, “Latent linguistic embedding for cross-lingual text-to-speech and voice conversion.” arXiv, Oct. 7, 2020, doi: https://doi. org/10.48550/arXiv.2010.03717
C.-M. Chien, J.-H. Lin, C. Huang, P. Hsu, and H. Lee, “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech,” in IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 8588–8592, doi: https://doi.org/10.1109/ICASSP39728.2021.9413880
X. Zhou, H. Che, X. Wang, and L. Xie, “A novel cross-lingual voice cloning approach with a few text-free samples,” arXiv, Oct. 30, 2019, doi: https://doi.org/10.48550/arXiv.1910.13276
Copyright (c) 2023 Usman Nawaz, Usman Ahmed Raza, Amjad Farooq, Muhammad Junaid Iqbal, Ammara Tariq
This work is licensed under a Creative Commons Attribution 4.0 International License.
UMT-AIR follow an open-access publishing policy and full text of all published articles is available free, immediately upon publication of an issue. The journal’s contents are published and distributed under the terms of the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Thus, the work submitted to the journal implies that it is original, unpublished work of the authors (neither published previously nor accepted/under consideration for publication elsewhere). On acceptance of a manuscript for publication, a corresponding author on the behalf of all co-authors of the manuscript will sign and submit a completed the Copyright and Author Consent Form.