Voice Cloning Using Transfer Learning with Audio Samples

  • Usman Nawaz Department of Engineering, University of Palermo, Palermo, Italy
  • Usman Ahmed Raza Department of Computer Science, University of Engineering and Technology, Lahore, Pakistan
  • Amjad Farooq Department of Computer Science, University of Engineering and Technology, Lahore, Pakistan
  • Muhammad Junaid Iqbal Department of Data Science, University of Rome, Tor Vergata, Rome, Italy
  • Ammara Tariq Department of Biochemistry and biotechnology, University of Gujrat, Gujrat, Pakistan
Keywords: artificial intelligence, audio cloning, machine learning, natural language processing, text to speech, voice recognition


Voice cloning refers to the artificial replication of a certain human voice. Several deep learning approaches were studied for voice cloning. After studying learning approaches, a cloning system was offered that creates natural-sounding audio samples within few seconds of source speech from the target speaker. From a speaker verification challenge to text-to-speech synthesis with multi-speaker capability, the current study used a transfer learning technique. In a zero-shot mode, this system creates speech sounds in the voices of various speakers, even individuals who were not seen during the training process. The current study used latent embedding’s to encode speaker-specific information, enabling additional model parameters to be pooled across all speakers. The speaker modelling stage was separated from voice synthesis by training a discrete speaker-discriminative encoder network. This is because networks require distinct types of input, disconnection enables each to be trained using separate datasets. When employed for zero-shot adaptability to unknown speakers, an embedding-based technique for voice cloning enhances speaker resemblance. Furthermore, it reduces computational resource needs which may be advantageous for use-cases requiring minimal resource deployment.


