Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model

1University of California, Merced 2Skywork AI

Reason3D is a novel LLM-based dense 3D point cloud searching and reasoning framework that can output dense segmentation masks based on textural information. Reason3D can handle tasks involving 1) 3D Reasoning, 2) 3D Hierarchical Searching, 3) 3D express referring, and 4) 3D QA with responding dense segmentation masks.


Recent advancements in multimodal large language models (LLMs) have shown their potential in various domains, especially concept reasoning. Despite these developments, applications in understanding 3D environments remain limited. They primarily offer textual or numerical outputs without the capability to generate dense, informative segmentation masks.

This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D takes point cloud data and text prompts as input to produce textual responses and segmentation masks, facilitating advanced tasks like 3D reasoning segmentation, hierarchical searching, express referring, and question answering with detailed mask outputs.

Specifically, we propose a hierarchical mask decoder to locate small objects within expansive scenes. This decoder initially generates a coarse location estimate covering the object’s general area. This foundational estimation facilitates a detailed, coarse-to-fine segmentation strategy that significantly enhances the precision of object identification and segmentation.

Reason3D Data


Examples about 3D reasoning segmentation that requires in-depth world knowledge and reasoning understanding.

Method: Reason3D


Initially, we utilize a point encoder to extract dense features from the input scene, simplified by a superpoint pooling layer to reduce complexity. An interactor merges superpoint features with a learnable query, input into a frozen LLM along with instructions to generate an output containing critical tokens, [LOC] and [SEG]. A hierarchical decoder then uses the [LOC] embedding to estimate a coarse location that likely covers the object. Finally, this estimated location integrates with the [SEG] embedding, enabling the prediction of the final segmentation masks.




  title={Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model},
  author={Kuan-Chih Huang and Xiangtai Li and Lu Qi and Shuicheng Yan and Ming-Hsuan Yang},