PixelLLM
Pixel-Aligned Language Model
CommonProductImageImage LocalizationLanguage Model
PixelLLM is a vision-language model for image localization tasks. It can generate descriptive text based on an input location and also generate pixel coordinates for dense localization based on input text. Pre-trained on the Localized Narrative dataset, the model has learned the alignment between words and image pixels. PixelLLM can be applied to a variety of image localization tasks, including instruction following localization, location-conditioned descriptions, and dense object descriptions, and has achieved state-of-the-art performance on datasets such as RefCOCO and Visual Genome.
PixelLLM Visit Over Time
Monthly Visits
1462
Bounce Rate
37.07%
Page per Visit
2.3
Visit Duration
00:00:59