Apple Open-Sources Multimodal Vision Model 4M-21, Capable of Executing Dozens of Tasks

AIbase

Published inAI News · 4 min read · Jul 7, 2024

Researchers from Apple and the Federal Institute of Technology in Lausanne, Switzerland, have jointly open-sourced a large-scale multimodal visual model named 4M-21. Unlike models that are optimized for specific tasks or data types, 4M-21 boasts broad versatility and flexibility. Despite having only 3 billion parameters, it offers a myriad of functionalities including image classification, object detection, semantic segmentation, instance segmentation, depth estimation, surface normal estimation, and more.

The core technology of the model is the "discrete tokens" conversion technique, which converts various modal data into a unified format of tokens sequences. Whether it's image data, neural network feature maps, vectors, structured data, or data represented as text, it can be transformed into a format that the model can understand. This conversion not only simplifies the training of the model but also lays the foundation for multimodal learning and processing.

Product Access: https://github.com/apple/ml-4m/

In the training phase, 4M-21 completes multimodal learning through the masked modeling approach. It randomly masks part of the input sequence tokens and then predicts the masked parts based on the remaining unmasked tokens. This method compels the model to learn the statistical structure and potential relationships of the input data, capturing the commonalities and interactions between different modalities. Masked modeling not only enhances the model's generalization ability but also improves the accuracy of generative tasks.

Researchers have conducted a comprehensive evaluation of 4M-21 on tasks such as image classification, object detection, semantic segmentation, instance segmentation, depth estimation, surface normal estimation, and 3D human pose estimation. The results show that the multimodal processing capabilities of 4M-21 can match the most advanced models currently available, performing excellently across various tasks.

Key Points:

- Apple and the Federal Institute of Technology in Lausanne, Switzerland, have jointly open-sourced the large-scale multimodal visual model 4M-21, which has broad versatility and flexibility.

- 4M-21 offers a range of functionalities including image classification, object detection, semantic segmentation, instance segmentation, depth estimation, surface normal estimation, and more.

- The key technology of 4M-21 is the "discrete tokens" conversion technique, which can convert various modal data into a unified format of tokens sequences.

AI Headlines

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

ElevenLabs Reader Launches on Android with Support for 32 New Languages

The ElevenLabs Reader app is now available on Android. This app not only reads articles, PDFs, or ePub files aloud but also offers hundreds of high-quality AI voice options, making reading accessible beyond just visual means.The ElevenLabs Reader app utilizes advanced AI technology to convert text into sound, allowing users to enjoy the pleasure of audiobooks anytime, anywhere. Currently, the app is live in the US, UK, and Canada, and with support for 32 languages coming soon, its global rollout

Jul 23, 2024

2.4k

Japanese Supermarkets Use AI to Monitor Employees' Smiles, Netizens Say It's Unnecessary

In today's pursuit of efficient and humanized management, Aeon, a large Japanese supermarket chain, has sparked widespread controversy by using AI technology to monitor employee smiles.On July 16, Yahoo Japan reported that Aeon has rolled out an AI system called "Smile-Kun" in 240 stores since July 1. This system analyzes employees' smiles and volume, providing real-time feedback to enhance customer service skills.The Smile-Kun system requires employees to greet it daily, then scores them based

Jul 23, 2024

1.2k

Musk's xAI to Launch Grok 2 Next Month, Grok 3 Expected in December

Elon Musk recently announced in an interview with psychologist and author Jordan Peterson that xAI will release its latest artificial intelligence model, Grok2, next month.Musk stated that Grok2's performance is expected to be on par with GPT-4, having been trained on approximately 15,000 H100 GPUs. Meanwhile, the release of Grok3 is scheduled for December, which Musk claims will be the most powerful AI model.During the interview, Musk further revealed that Grok3 is currently being trained at th

Jul 23, 2024

1.8k

Xiaohongshu Launches First 'AI Convenience Store' to Support Quality AI Notes with Traffic Boosts of 30,000 to 500,000

Recently, Xiaohongshu officially announced the launch of an intriguing tech account named "TechPotato," introducing its entrepreneurial project: the world's first AI-powered convenience store.According to official information, the "AI Convenience Store" is a collaboration between TechPotato and top authors across the web, along with renowned tech companies, to curate the most authentic and visually appealing AI-related content, which is then offered as "limited edition products" for continuous o

Jul 23, 2024

2.9k

Microsoft Research Introduces AI Framework E5-V: Simplifying Multimodal Learning with Text Pair Unimodal Training to Reduce Costs

Recently, a research team from Microsoft Research and Beihang University has jointly introduced a novel framework called E5-V, aimed at providing a more efficient solution for multi-modal embeddings. With the continuous advancement of artificial intelligence, multi-modal large language models (MLLMs) have become a focal point of research, as they are capable of understanding both textual and visual information simultaneously, thereby better handling complex data relationships. However, effective

Jul 23, 2024

1.7k

LensGo AI Launches FaceSync: Sync Your Voice with Target Images

Recently, LensGo AI has introduced a new feature called FaceSync. The core of FaceSync lies in its ability to synchronize a user's performance video with a chosen image or video, creating a brand-new visual experience. Users can record their own performances, select an image or video, and FaceSync will use its advanced AI technology to perfectly blend the user's facial expressions, voice, and lip movements with the target image or video.When using FaceSync, users can express themselves completel

Jul 23, 2024

2.4k

Nvidia Stock Soars as New AI Chip Meets US Export Regulations

Nvidia Corporation (Nvidia, NVDA) experienced a significant surge in stock prices during Monday's morning trading session, becoming the focal point of market attention. This surge was primarily due to the company's announcement that its upcoming Blackwell chip for the Chinese market complies with U.S. export control requirements. Following a sell-off in the stock market last week, chip stocks rebounded, and Nvidia's share price also increased.It is reported that Nvidia will collaborate with Chin

Jul 23, 2024

1.0k

Shocking AI News! Llama 3.1 Leaked: The 405 Billion Parameter Open-Source Giant Is Here!

Llama 3.1 has been leaked! You heard that right, this open-source model with 405 billion parameters has caused quite a stir on Reddit. It might be the closest open-source model to GPT-4o to date, and even surpasses it in certain aspects.Llama 3.1 is a large language model developed by Meta (formerly Facebook). Although the official release has not yet been made, the leaked version has already stirred up the community. This model not only includes the base model but also benchmark results for 8B,

Jul 23, 2024

1.2k

Luma AI Launches Loops: Create Smooth Infinite Videos from Text and Images

Recently, the AI startup Luma AI, headquartered in San Francisco, officially launched a new feature called "Loops" for its Dream Machine platform. This new feature allows users to create seamless, continuous video loops through text descriptions, images, or keyframes. This means that content creators and digital marketers can now easily produce infinitely looping videos without worrying about noticeable cuts or transitions.Imagine content creators and digital marketers now being able to produce

Jul 23, 2024

6.8k

Stable Audio Open: Open Source Audio Generation Model by Stability AI (Up to 47 Seconds Long)

Recently, the Stability AI team introduced a new open-source audio generation model named Stable Audio Open. This model's unique feature is its ability to generate stereo audio up to 47 seconds long from text prompts, with a sampling rate as high as 44.1kHz.Product Entry:https://top.aibase.com/tool/stable-audio-open-demo Unlike many popular audio generation models currently available, Stable Audio Open's weights are open, meaning anyone can view, modify, and extend this model. This design philos

Jul 23, 2024

1.4k

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Apple Open-Sources Multimodal Vision Model 4M-21, Capable of Executing Dozens of Tasks

AIbase

This article is from AIbase Daily

AI News Recommendations

ElevenLabs Reader Launches on Android with Support for 32 New Languages

Japanese Supermarkets Use AI to Monitor Employees' Smiles, Netizens Say It's Unnecessary

Musk's xAI to Launch Grok 2 Next Month, Grok 3 Expected in December

Xiaohongshu Launches First 'AI Convenience Store' to Support Quality AI Notes with Traffic Boosts of 30,000 to 500,000

Microsoft Research Introduces AI Framework E5-V: Simplifying Multimodal Learning with Text Pair Unimodal Training to Reduce Costs

LensGo AI Launches FaceSync: Sync Your Voice with Target Images

Nvidia Stock Soars as New AI Chip Meets US Export Regulations

Shocking AI News! Llama 3.1 Leaked: The 405 Billion Parameter Open-Source Giant Is Here!

Luma AI Launches Loops: Create Smooth Infinite Videos from Text and Images

Stable Audio Open: Open Source Audio Generation Model by Stability AI (Up to 47 Seconds Long)