ml-ferret is an end-to-end machine learning language model (MLLM) that can accept various forms of references and respond with precise localization in multimodal environments. It combines mixed regional representations and spatially aware visual samplers, supporting fine-grained and open-vocabulary referencing and localization. Additionally, ml-ferret includes the GRIT dataset (approximately 1.1 million samples) and the Ferret-Bench evaluation benchmark.