PixelLLM
Pixel-Aligned Language Model
CommonProductImageImage LocalizationLanguage Model
PixelLLM is a vision-language model for image localization tasks. It can generate descriptive text based on an input location and also generate pixel coordinates for dense localization based on input text. Pre-trained on the Localized Narrative dataset, the model has learned the alignment between words and image pixels. PixelLLM can be applied to a variety of image localization tasks, including instruction following localization, location-conditioned descriptions, and dense object descriptions, and has achieved state-of-the-art performance on datasets such as RefCOCO and Visual Genome.
PixelLLM Visit Over Time
Monthly Visits
1134
Bounce Rate
44.04%
Page per Visit
1.8
Visit Duration
00:01:00