DH-Bench: Probing Depth and Height Perception of Large Visual-Language Models

1 Center for Research in Computer Vision, University of Central Florida; 2 Microsoft Research; 3 Indian Institute of Technology, Kharagpur.

What is DH-Bench?

DH-Bench is a benchmark containing programmatically generated synthetic and real-world data for depth and height perception tasks that can be solved by humans very easily, but pose significant challenges for current visual language models (VLMs).


Example images in DH-Bench. Here each samples are shown with random query attributes- color and numeric label for Synthetic 2D, color and material for Synthetic 3D and numeric label for Real-World dataset.


Example image, prompt, its different question types, and corresponding answers in DH-Bench.

This website is under construction. Thank you for your patience

Abstract

Geometric understanding is crucial for navigating and interacting with our environment. While large Vision Language Models (VLMs) demonstrate impressive capabilities, deploying them in real-world scenarios necessitates a comparable geometric understanding in visual perception. In this work, we focus on the geometric comprehension of these models; specifically targeting the depths and heights of objects within a scene. Our observations reveal that, although VLMs excel in basic geometric properties perception such as shape and size, they encounter significant challenges in reasoning about the depth and height of objects. To address this, we introduce a suite of benchmark datasets—encompassing Synthetic 2D, Synthetic 3D, and Real-World scenarios—to rigorously evaluate these aspects. We benchmark 17 state-of-the-art VLMs using these datasets and find that they consistently struggle with both depth and height perception. Our key insights include detailed analyses of the shortcomings in depth and height reasoning capabilities of VLMs and the inherent bias present in these models. This study aims to pave the way for the development of VLMs with enhanced geometric understanding, crucial for real-world applications.

BibTeX

@misc{azad2024dhbenchprobingdepthheight,
        title={DH-Bench: Probing Depth and Height Perception of Large Visual-Language Models}, 
        author={Shehreen Azad and Yash Jain and Rishit Garg and Yogesh S Rawat and Vibhav Vineet},
        year={2024},
        eprint={2408.11748},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2408.11748}, 
  }