The speech of deepfakes are created by using a text-to-speech model to generate speech from text. Once a model is trained, it can be used to generate speech with any voice. Usually such models are separated into voice encoder, synthesizer and vocoder. A voice encoder learns to create a latent, fixed-dimensional embedding (vector) that captures various features of a particular human voice. The synthesizer learns to create a mel-spectrogram from a text transcript for a specific voice. The vocoder generates an audio waveform from the mel-spectrogram.
In this video, I introduce you to the theoretical background of text-to-speech synthesis and show you how you can create speech yourself with any voice you have access to.
My Medium Article for This Video:
https://medium.com/p/db046e009a7000:00:00 Intro
00:01:25 Single-Speaker vs. Multi-Speaker
00:02:14 Multi-Speaker Approach
00:02:31 Speaker Encoder
00:03:55 Synthesizer
00:04:25 Mel Spectogram
00:05:31 Vocoder
00:06:26 Model Summary
00:07:29 Hands-On Voice Cloning
00:09:36 Speech Generation
00:15:03 Outro
https://patreon.com/MartinThissen (of course, financial support is completely voluntary, but I was asked for this)
I'm happy about any feedback I can get. :) So feel free to share it with me in the comment section, thanks. :)
553 Comments