Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model

1University of California, Merced 2Skywork AI
empty

Reason3D is a novel LLM-based dense 3D point cloud searching and reasoning framework that can output dense segmentation masks based on textural information. Reason3D can handle tasks involving 1) 3D Reasoning, 2) 3D Hierarchical Searching, 3) 3D express referring, and 4) 3D QA with responding dense segmentation masks.

Abstract

Recent advancements in multimodal large language models (LLMs) have shown their potential in various domains, especially concept reasoning. Despite these developments, applications in understanding 3D environments remain limited. They primarily offer textual or numerical outputs without the capability to generate dense, informative segmentation masks.

This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D takes point cloud data and text prompts as input to produce textual responses and segmentation masks, facilitating advanced tasks like 3D reasoning segmentation, hierarchical searching, express referring, and question answering with detailed mask outputs.

Specifically, we propose a hierarchical mask decoder to locate small objects within expansive scenes. This decoder initially generates a coarse location estimate covering the object’s general area. This foundational estimation facilitates a detailed, coarse-to-fine segmentation strategy that significantly enhances the precision of object identification and segmentation.

Reason3D Data

empty

Examples about 3D reasoning segmentation that requires in-depth world knowledge and reasoning understanding.

Method: Reason3D

empty

Initially, we utilize a point encoder to extract dense features from the input scene, simplified by a superpoint pooling layer to reduce complexity. An interactor merges superpoint features with a learnable query, input into a frozen LLM along with instructions to generate an output containing critical tokens, [LOC] and [SEG]. A hierarchical decoder then uses the [LOC] embedding to estimate a coarse location that likely covers the object. Finally, this estimated location integrates with the [SEG] embedding, enabling the prediction of the final segmentation masks.

Visualization

empty

BibTeX

@article{reason3d,
  title={Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model},
  author={Kuan-Chih Huang and Xiangtai Li and Lu Qi and Shuicheng Yan and Ming-Hsuan Yang},
  journal={arXiv},
  year={2024}
}