VoiceCraft is a token-filling based neural encoder-decoder language model that achieves leading performance in voice editing and zero-shot text-to-speech (TTS). For unseen voices, VoiceCraft only needs a few seconds of voice samples to clone the voice or edit the recording. The model is suitable for wild data such as audiobooks, online videos, and podcasts.