LL Technology Office: Editing Speech Waveforms with Neural Codec Language Models

Tue May 14, 2024 10:30–11:30 AM

Description

An exciting recent development in the speech research community is the exploration of neural codec language models (NCLMs) as generative models of speech. NCLMs have been enabled on the one hand by the invention of high-fidelity speech codecs that use neural networks to learn a discrete, tokenized representation of the speech signal, and on the other hand by powerful language models based on transformer neural networks. In this talk, I will present my group’s work on VoiceCraft, an NCLM that is capable of performing targeted edits of speech recordings where words can be arbitrarily inserted, deleted, or substituted in the waveform itself. These edits preserve the speaker’s voice, prosody, and speaking style, while leaving the non-edited regions of the waveform completely intact. Subjective human evaluations indicate that the naturalness of the edited speech is approximately on par with that of the un-edited speech. Furthermore, I will also show how voice-cloning text-to-speech synthesis can be cast as a speech editing task, and demonstrate that VoiceCraft achieves a new state-of-the-art performance in this task, outperforming commercial models such as VALL-E and XTTS-v2.

LL Technology Office: Editing Speech Waveforms with Neural Codec Language Models
An exciting recent development in the speech research community is the exploration of neural codec language models (NCLMs) as generative models of speech. NCLMs have been enabled on the one hand by the invention of high-fidelity speech codecs that use neural networks to learn a discrete, tokenized representation of the speech signal, and on the other hand by powerful language models based on transformer neural networks. In this talk, I will present my group’s work on VoiceCraft, an NCLM that is capable of performing targeted edits of speech recordings where words can be arbitrarily inserted, deleted, or substituted in the waveform itself. These edits preserve the speaker’s voice, prosody, and speaking style, while leaving the non-edited regions of the waveform completely intact. Subjective human evaluations indicate that the naturalness of the edited speech is approximately on par with that of the un-edited speech. Furthermore, I will also show how voice-cloning text-to-speech synthesis can be cast as a speech editing task, and demonstrate that VoiceCraft achieves a new state-of-the-art performance in this task, outperforming commercial models such as VALL-E and XTTS-v2.