Recently, Microsoft's latest visual foundation model, Florence-2, has made significant breakthroughs. Utilizing Transformers.js technology, this model can now run 100% locally in browsers that support WebGPU. This breakthrough brings revolutionary changes to AI visual applications, enabling powerful visual recognition capabilities to be implemented directly in users' browsers without relying on remote servers.

Florence-2-base-ft is a visual foundation model with 230 million parameters, using a prompt-based approach to handle a wide range of visual and visual-language tasks. The model supports multiple functionalities, including but not limited to:

  1. Image caption generation
  2. Optical Character Recognition (OCR)
  3. Object detection
  4. Image segmentation

image.png

This powerful model only occupies 340MB of storage space. Once loaded, it is cached in the browser, and users can directly invoke it when they revisit the page without needing to download it again. Most impressively, the entire process is conducted locally in the user's browser without any API calls to the server. This means that once the model is loaded, users can still use all functionalities even if they disconnect from the internet.

The local operation of Florence-2 is made possible by the support of 🤗 Transformers.js and ONNX Runtime Web technologies. This breakthrough not only enhances user privacy protection but also significantly reduces usage costs, paving the way for the widespread application of AI visual technology.

For developers and tech enthusiasts, the ONNX model of Florence-2 is already available on the Hugging Face platform. Interested individuals can visit https://huggingface.co/models?library=transformers.js&other=florence2 for more details. Additionally, the project's source code has also been made public on GitHub, and developers can obtain it via https://github.com/xenova/transformers.js/tree/v3/examples/florence2-webgpu for further exploration and development.

This breakthrough by Florence-2 will undoubtedly drive the rapid development and widespread adoption of AI visual applications. We can look forward to seeing more browser-based intelligent visual applications changing our daily lives and work methods in the near future.