SpeechX – Microsoft Research
SpeechX – Microsoft Research
“SpeechX is a versatile speech generation model leveraging audio and text prompts, which can deal with both clean and noisy speech inputs and perform zero-shot TTS and various tasks involving transforming the input speech. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting. This enables unified treatment of various tasks in an extensible manner, providing a consistent way of leveraging text input for speech enhancement and transformation. The current model, trained on 60K hours of speech audio, can perform zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing, where the spoken content can be altered while preserving the speaker and background sounds…”