The Alibaba team has released EMO, a portrait video generation framework capable of producing audio portraits with rich facial expressions and head poses. EMO utilizes a reference network to extract features from reference images and motion frames, processes audio through a pre-trained audio encoder for embedding, and combines multi-frame noise with facial region masks to generate videos. Experimental results show that EMO outperforms existing methods in terms of expressiveness and realism. The potential applications of this model could enhance the level of digital media and virtual content generation technology, but it may also be misused as a tool for criminal activities.