Denoising Vision Transformers
Provides clean visual features
CommonProductImageImage ProcessingDeep Learning
Denoising Vision Transformers (DVT) is a novel noise model for Vision Transformers (ViTs). By dissecting the ViT output and introducing a learnable denoiser, DVT can extract noise-free features, significantly improving the performance of Transformer-based models in both offline and online applications. DVT does not require retraining existing pre-trained ViTs and can be applied immediately to any Transformer-based architecture. Through extensive evaluations on multiple datasets, we found that DVT consistently and significantly improves existing state-of-the-art general models (e.g., +3.84 mIoU) in both semantic and geometric tasks. We hope our research encourages a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.
Denoising Vision Transformers Visit Over Time
Monthly Visits
20899836
Bounce Rate
46.04%
Page per Visit
5.2
Visit Duration
00:04:57