ODIN Model

Single model implements 2D and 3D perception

CommonProductImagecomputer visioninstance segmentation

ODIN (Omni-Dimensional INstance segmentation) is a model that uses a transformer architecture for segmentation and labeling on both 2D RGB images and 3D point clouds. It distinguishes 2D and 3D feature operations by iteratively fusing information between 2D views and 3D views. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D, and AI2THOR 3D instance segmentation benchmarks, and achieves competitive performance on ScanNet, S3DIS, and COCO. When using sampled point clouds from 3D meshes instead of perceived 3D point clouds, it surpasses all previous works. As the 3D perception engine in a guided concretization agent architecture, it sets a new state-of-the-art on the TEACh dialogue action benchmark. Our code and checkpoints can be found on the project website.

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

ODIN Model

ODIN Model Visit Over Time

ODIN Model Visit Trend

ODIN Model Visit Geography

ODIN Model Traffic Sources

ODIN Model Alternatives

ODIN Model — Single model implements 2D and 3D perception

YOLOv8 — YOLOv8 Object Detection and Tracking Model

LHM — High-fidelity, animatable 3D human reconstruction model, quickly generating animated characters.

Thera — An aliasing-free arbitrary-scale super-resolution method.

MIDI — Generates high-fidelity 3D scenes from a single image using a multi-instance diffusion model.

GaussianCity — An efficient boundless 3D city generation framework that uses 3D Gaussian rendering technology for fast generation.

MLGym — MLGym is a novel framework and benchmark for advancing AI research agents.

Pippo — Pippo is a generative model that creates high-resolution, multi-view videos from a single photograph.

VideoWorld — VideoWorld is a deep generative model that explores knowledge acquisition from unlabelled video data.

Video Depth Anything — Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

ViTPose — A collection of ViTPose models implemented based on the Transformer architecture.

Diffusion as Shader — A unified architectural model supporting various video generation control tasks.

TryOffAnyone — Generates flat fabric models from images of dressed individuals.

FlagAI — A comprehensive open-source project for large model algorithms, models, and optimization tools.

video-analyzer — A video analysis tool that combines Llama's visual model and OpenAI Whisper to generate local video descriptions.

MegaSaM — Quickly and accurately estimate camera and dense structure from everyday dynamic videos.

NVIDIA Jetson Orin Nano Super Developer Kit — NVIDIA's most affordable generative AI supercomputer

Diffusion-Vas — Advanced Research on Non-Visible Object Segmentation and Content Completion in Videos

StableAnimator — A high-quality portrait animation synthesis tool with identity preservation.

CHOIS — Human-Object Interaction Synthesis technology based on Conditional Diffusion Models

PSHuman — Reconstruct realistic 3D human models from a single image.

text-to-pose — A model for generating poses from text and further generating images.

Phantomy AI — Gesture recognition technology for future presentation control

DINO-X — Unified visual model for open-world detection and understanding

Data Annotation Platform — A data annotation platform that empowers efficient management of data annotation projects for AI initiatives.

AutoSeg-SAM2 — An automatic full video segmentation tool based on Segment Anything 2 and Segment Anything 1.

TurboLens — A one-stop OCR solution for rapidly generating insights from images.

LLaMA-Mesh — Unified 3D Mesh Generation with Language Models

CountAnything — An application that uses advanced computer vision algorithms for automated and accurate counting.

NVIDIA AI Blueprint — Utilize NVIDIA AI to build video search and summarization agents.