인문학
사회과학
자연과학
공학
의약학
농수해양학
예술체육학
복합학
지원사업
학술연구/단체지원/교육 등 연구자 활동을 지속하도록 DBpia가 지원하고 있어요.
커뮤니티
연구자들이 자신의 연구와 전문성을 널리 알리고, 새로운 협력의 기회를 만들 수 있는 네트워킹 공간이에요.
논문 기본 정보
- 자료유형
- 학술저널
- 저자정보
- 발행연도
- 2026.6
- 수록면
- 182 - 193 (12page)
이용수
초록· 키워드
Visual Question Answering (VQA) effectively addresses semantic queries about objects in an image but remains limited in inferring physical properties such as inter-object distances from a monocular image. This study proposes DistVQA (Distance Visual Question Answering), a novel task integrating VQA with metric distance estimation between objects within a single image. We then present DIF (Distance Inference Framework), a framework that accomplishes this task without requiring additional training by leveraging frozen pre-trained models. Key contributions include: 1) defining the new task of answering questions about physical distances between objects, 2) creating the DistVQA-VK2 benchmark dataset, derived from Virtual KITTI 2, which contains image-question pairs querying inter-object distances with ground-truth answers, 3) proposing the DIF framework comprised of a Metric Depth Estimator, a Focal Length Estimator, a Small Language Model, an Object Segmentation Module, and a Distance Calculation Module that supports both center-to-center and minimum surface-to-surface distances, 4) introducing the evaluation metrics QLA (Queried- object Localization Accuracy), DVA (DistVQA Accuracy), and CDA (Conditional Distance Accuracy) for jointly assessing localization and distance reasoning. Experimental results show that QLA reaches up to 87%, DVA achieves 87%, and CDA remains above 82% across all conditions. These findings establish the first baseline for the DistVQA task and demonstrate the feasibility of modular, training- free distance reasoning framework from a monocular image.
상세정보 수정요청해당 페이지 내 제목·저자·목차·페이지정보가 잘못된 경우 알려주세요!
목차
- Abstract
- 1. 서론
- 2. 관련 연구
- 3. 데이터세트
- 4. 프레임워크
- 5. 실험 및 결과
- 6. 결론
- References