AIM

Pre-training of Large-Scale Autoregressive Image Models

CommonProductImageVisual ModelAutoregressive Pre-training

This paper introduces AIM, a family of visual models pre-trained using autoregressive objectives. Inspired by their textual counterparts, the large language models (LLMs), these models exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of visual features improves with increasing model capacity and dataset size, and (2) the value of the objective function correlates with model performance on downstream tasks. By pre-training a 70-billion parameter AIM on 2 billion images, we achieved 84.0% accuracy on ImageNet-1k using a frozen backbone. Interestingly, even at this scale, we observe no signs of performance saturation, suggesting that AIM may represent a new frontier in training large-scale visual models. AIM's pre-training is similar to that of LLMs and does not require any image-specific strategies to stabilize large-scale training.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

AIM

AIM Visit Over Time

AIM Visit Trend

AIM Visit Geography

AIM Traffic Sources

AIM Alternatives

AIM — Pre-training of Large-Scale Autoregressive Image Models

Vary — Visual Vocabulary Expansion for Large-Scale Visual Language Models

olmo-mix-1124 — Large-scale multimodal pre-training dataset

DiT-MoE — Large-scale Parameter Diffusion Transformer Model

Crawl4LLM — An efficient web crawler for LLM pre-training, focused on crawling high-quality web data effectively.

Chinese Internet Corpus Resource Platform — Providing high-quality Chinese language corpus resources to assist in the pre-training of large AI models.

persona-hub — Large-scale synthetic dataset, empowering personalized research

Nemotron-CC — Transforms Common Crawl into a refined long-term pre-training dataset.

SPARC — Enhancing fine-grained understanding of image-text pre-training

Aphrodite Engine — PygmalionAI's large-scale inference engine

Gemma-2B-10M — The Gemma 2B model supports 10M sequence length, optimizes memory usage, and is suitable for large-scale language model applications.

TableGPT2 — A large multimodal model that integrates tabular data.

Visual Anagrams — Visual illusions are created using a pre-trained diffusion model.

PIXTA AI - AI/ML Training Data Service — Pixta AI | Large-Scale Data Annotation and Data Collection Service

recurrent-pretraining — Pretraining code for large-scale deep recurrent language models, capable of running on 4096 AMD GPUs.

MoE-LLaVA — An expert mixture model based on large-scale vision-language models

DeepSeek-VL2-Tiny — Advanced Large-scale Mixture of Experts Visual Language Model

DeepSeek-VL2-Small — An advanced large-scale mixture of experts visual language model.

Maia 100 — A customized AI accelerator by Microsoft, specifically designed for large-scale AI workloads.

DynVideo-E — Human video editing using dynamic NeRF for large-scale motion and viewpoint changes

MarDini — A self-regressive diffusion model for large-scale video generation.

InternVL2_5-26B — A large multimodal language model that integrates visual and linguistic understanding.

Large Geospatial Model — A geospatial model that employs large-scale machine learning to understand scenes and connect millions of locations worldwide.

Tülu 3 405B — Tülu 3 405B is a large-scale open-source language model enhanced through reinforcement learning.

SEED — Empowers LLMs with the ability to see and draw images

POINTS-Yi-1.5-9B-Chat — Latest advancements in visual language models, integrating new technologies from WeChat AI.

Doubao Large Model — A large model developed by ByteDance, providing multimodal capabilities.

Data-Juicer — A one-stop data processing system that provides high-quality data for large language models.

TimesFM — Decoder-based foundational model for time series prediction

Open-MAGVIT2 — Open-source autoregressive visual generation model project