Google’s WaveNetEQ fills in speech gaps during Duo calls
Google today detailed an AI system — WaveNetEQ — it recently deployed to Duo, its cross-platform voice and video chat app, that can realistically synthesize short snippets of speech to replace garbled audio caused by an unstable internet connection. It’s fast enough to run on a smartphone while delivering state-of-the-art, natural-sounding audio quality, laying the groundwork for future chat apps optimized for bandwidth-constrained environments.
Here’s how it sounds compared with Duo’s old solution (the first is WaveNetEQ):
As Google explains, to ensure reliable real-time communication, it’s necessary to deal with packets (i.e., formatted units of data) that are missing when the receiver needs them. (The company says that 99% of Duo calls need to deal with network issues, and that 10% of calls lose more than 8% of the total audio duration due to network issues.) If new audio isn’t delivered continuously, audible glitches and gaps will occur, but repeating the same audio isn’t ideal because it produces artifacts and reduces overall call quality.
Google’s solution — WaveNetEQ — is what’s called a packet loss containment module, which is responsible for creating data to fill in the gaps created by packet losses, excessive jitter, and other mishaps.
Architecturally, WaveNetEQ is a modified version of DeepMind’s WaveRNN, a machine learning model for speech synthesis consisting of autoregressive and conditioning networks. The autoregressive network provides short- and mid-term speech structure by having each generated sample depend on the network’s previous outputs, while the conditioning network influences the autoregressive network to produce audio consistent with the more slowly-moving input features.
WaveNetEQ uses the autoregressive network to provide the audio continuation and the conditioning network to model long-term features, like voice characteristics. The spectrogram — i.e., the visual representation of the spectrum of frequencies — of the past audio signal is used as input for the conditioning network, which extracts information about the prosody and textual content. This condensed information is fed to the autoregressive network, which combines it with the audio of the recent past to predict the next sample in the waveform domain.
To train WaveNetEQ model, Google fed the autoregressive network samples from a training data set as input for the next step, rather than the last sample the model produced. This was to ensure WaveNetEQ learned valuable speech information even at an early stage of training, when its predictions were still low-quality. The aforementioned corpus contained voice recordings from 100 speakers in 48 different languages, as well as a wide variety of background noises to ensure that the model could deal with noisy environments.
Once WaveNetEQ was fully trained and put to use in Duo audio and video calls, the training was only used to “warm up” the model for the first sample; in production, WaveNetEQ’s output is passed back as input for the next step.
WaveNetEQ is applied to the audio data in Duo’s jitter buffer so that once the real audio continues after packet loss, it seamlessly merges the synthetic and real audio stream. To find the best alignment between the two signals, the model generates slightly more output than is required and then cross-fades from one to the other, avoiding noticeable noise.
Google says that in practice, WaveNetEQ can plausibly finish syllables up to 120 milliseconds in length.
WaveNetEQ is already available in Duo on the Pixel 4 and Pixel 4 XL — they arrived as a part of the feature drop in December — and Google says it’s in the process of rolling out the system to additional devices. It’s unclear which devices, however — we’ve reached out to Google for clarification and we’ll update this post once we hear back.