인문학
사회과학
자연과학
공학
의약학
농수해양학
예술체육학
복합학
지원사업
학술연구/단체지원/교육 등 연구자 활동을 지속하도록 DBpia가 지원하고 있어요.
커뮤니티
연구자들이 자신의 연구와 전문성을 널리 알리고, 새로운 협력의 기회를 만들 수 있는 네트워킹 공간이에요.
논문 기본 정보
- 자료유형
- 학술저널
- 저자정보
- 발행연도
- 2026.5
- 수록면
- 805 - 819 (15page)
- DOI
- 10.9717/kmms.2026.29.5.805
이용수
초록· 키워드
3D Visual Grounding (3DVG) aims to localize objects in a 3D scene that correspond to a given natural language query and plays a critical role in applications such as robotics and autonomous systems. With recent advances in Vision-Language Models (VLMs), zero-shot approaches to 3DVG that leverage pre-trained VLMs without task-specific 3D supervision have gained increasingly attention. However, such approaches heavily rely on pre-trained knowledge and are sensitive to the configuration of textual and visual inputs. While fully supervised 3DVG methods have been extensively studied, a systematic analysis of zero-shot VLM-based 3DVG remains limited. In this work, we conduct a comprehensive analysis of VLM-based zero-shot 3D Visual Grounding by varying natural language query formulations and visual input configurations, with a particular focus on modality contribution. Our analysis reveals that current VLM-based zero-shot approaches exhibit limited capability in relational reasoning and tend to rely on textual cues rather than visual evidence. These findings highlight inherent structural limitations of existing zero-shot VLM-based 3DVG pipelines. Based on our observations, we further discuss the necessity of incorporating structured 3D representations or explicit mechanisms for modeling spatial relationships to enable more reliable reasoning in future zero-shot 3D Visual Grounding systems.
상세정보 수정요청해당 페이지 내 제목·저자·목차·페이지정보가 잘못된 경우 알려주세요!
목차
- ABSTRACT
- 1. 서론
- 2. 관련 연구
- 3. 분석 방법
- 4. 실험 결과 및 분석
- 5. 결론
- REFERENCE