The research teams from Illinois Institute of Technology, Zhejiang University, University of Central Florida, and University of Illinois at Chicago have recently unveiled a groundbreaking 3D scene large language model named Robin3D.

This model was trained on a massive dataset containing one million instruction-following data entries and has achieved the best performance in five commonly used 3D multimodal learning benchmarks, marking a significant advancement in the direction of building universal 3D agents.

image.png

The success of Robin3D is attributed to its innovative data engine, RIG (Robust Instruction Generation). The RIG engine is designed to generate two critical types of instruction data: adversarial instruction-following data and diverse instruction-following data.

Adversarial instruction-following data enhances the model's discriminative understanding by mixing positive and negative samples, while diverse instruction-following data includes various instruction styles to improve the model's generalization capabilities.

image.png

Researchers note that existing 3D large language models primarily rely on positive 3D visual-language pairings and template-based instructions for training, which leads to insufficient generalization and the risk of overfitting. Robin3D effectively overcomes these limitations by introducing adversarial and diverse instruction data.

Robin3D also integrates a Relation-Augmented Projector (RAP) for ID feature binding (IFB) reference and localization capabilities. The RAP module enhances object-centric features with rich scene-level context and positional information, while the IFB module strengthens the connection between each ID and its corresponding feature by binding them together.

image.png

Experimental results show that Robin3D surpasses previous best methods across five benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without the need for task-specific fine-tuning.

Particularly in the Multi3DRefer evaluation with zero-target cases, Robin3D achieved significant improvements of 7.8% and 7.3% on the F1@0.25 and F1@0.5 metrics, respectively.

The release of Robin3D signifies a major advancement in spatial intelligence for 3D large language models, laying a solid foundation for building more universal and powerful 3D agents in the future.

Paper link: https://arxiv.org/pdf/2410.00255