Translated data: Google's research team has released E3TTS, a high-quality end-to-end text-to-speech model. E3TTS utilizes BERT and diffusion UNet models to generate audio waveforms directly from text, supporting multilingual and zero-shot tasks. Experiments have shown that its performance is close to the state-of-the-art neural TTS systems, bringing innovation to the field of speech synthesis, enhancing quality and efficiency, and offering new opportunities for AI voice applications.