ViewDiff is a method for generating multi-view consistent images from real-world data by leveraging pre-trained text-to-image models as prior knowledge. It incorporates 3D volume rendering and cross-frame attention layers into the U-Net network, enabling the generation of 3D-consistent images in a single denoising process. Compared to existing methods, ViewDiff generates results with better visual quality and 3D consistency.