
Demystifying AI dubbing - Part 1: Intro to Dubbing
As I’m currently working on building dubbing pipelines for Dubify.it, I realized it’d be a good opportunity to dissect and explain the main components involved.
This will be part one in a series of blog posts diving into the various components, explaining in detail how they work.
We’ll start with an overview of AI-based dubbing.
What makes dubbing?
Roughly, we can separate the dubbing process into the following components:
- Speech To Text - Convert an audio source into text
- Translation - Translate text from the source language to the target language
- Text To Speech - Produce speech from the text translation in step 2
- Lip sync - Synchronize lip movements of characters in a video with an audio speech source
Each of the steps above executes in series with the output of the previous step as the input of the next.
At the end, we end up with a new video in which the speaker’s voice has been naturally translated to another language, with the lip movements perfectly synced in accordance with the audio.
In a modern system, each of these components uses (recent) AI technology at its core. Technology mostly based on diffusion models, variational autoencoders, and transformers.
Speech To Text
Modern Speech To Text pipelines are mostly based on the transformer model architecture. The hugely popular Whisper (Robust Speech Recognition via Large-Scale Weak Supervision) is a good example of this.
While speech to text used to be a hard problem, the freely available and robust multilingual open source AI models have mostly solved this problem.
Translation
Translation has similarly used to be a hard problem, requiring massive resources to build your own models or leverage Google Translate API (or the few other reasonable alternatives). The recent massive advancements in LLMs have made this a null issue, as current LLMs (such as ChatGPT or Llama) are arguably better at translation than even the ad-hoc models that were specifically trained for the task.
These models are based on the transformer architecture (specifically transformer decoders), originally suggested in Attention is All You Need with some adjustments (but the core idea and architecture remain the same to this date).
Text To Speech
The area of text to speech has seen a lot of recent advancements and publications. Fortunately, many of the recent publications include open source code and models.
Text To Speech is an integral part of any dubbing system. The easiest and most robust techniques for generating speaker audio rely on an input text combined with the original audio source. This combination allows the system to take into account the various characteristics of the speaker’s voice, tone, prosody, etc. while generating controlled audio from the given text input.
Most recent model architectures in this field rely on Latent Diffusion Models. These models comprise a rather complex combination of audio embedding models (such as Whisper), Variational Autoencoders (to reduce the dimension of the input, which in audio is generally quite large), and Diffusion Models to regenerate the modified audio part by part.
The use of the audio embedding as part of the input allows cloning the speaker’s voice and generating naturally sounding speech in various different languages.
This is a complex and interesting pipeline. We will cover this in depth in part 3 of the series.
Lip Sync
This is the final part required to merge the generated speech with the video.
In order for the video to look natural, we need the audio to be synced with the face and lips of the speaker.
This is done by a specialized pipeline that synchronizes the speaker’s face motion with the audio, frame by frame, and regenerates the video by concatenating the frames and re-encoding the video.
The leading model architectures used in this step are Latent Diffusion Models leveraging audio encoding models (such as Whisper) for latent inputs, using Variational Autoencoders (to reduce the dimension of input features) and Diffusion Models to regenerate the video, frame by frame.
The pipeline is quite similar to the Text To Speech pipeline explained above. The reason for this is that audio and video are usually modeled similarly. Audio is modeled as a Mel Spectrogram, which is essentially a (scaled) 2D representation of the audio wave in the frequency domain (after being transformed by a Fourier transform), while a video is naturally a collection of images (which are a 2D representation of the pixels, usually in 3 (RGB) channels).
There are a lot of added nuances that differ from the Text To Speech pipeline, such as masking part of the image and generating only the part of the face that we want to change (i.e. lips area), identifying the speaker’s face, etc.
Up Next
In part 2 of this blog, we dive deep into the mechanism in which Speech-To-Text (STT) works and dissect the different parts of the STT pipeline. We’ll get a good understanding of how it works “under the hood”.