OmniHuman-1 is an end-to-end multimodal conditional human video generation framework that can create human videos based on a single portrait and motion signals (such as audio, video, or a combination of both). This technology overcomes the challenge of high-quality data scarcity through a mixed training strategy and supports images of arbitrary aspect ratios, producing realistic human videos. It excels in handling weak signal inputs, particularly audio, making it suitable for various scenarios, including virtual streaming and video production.