Understanding Depth and Height Perception of Large Visual-Language Models

Abstract

Geometric understanding - including depth and height perception - is fundamental to intelligence and crucial for navigating our environment. Despite the impressive capabilities of large Vision Language Models (VLMs), it remains unclear how well they possess the geometric understanding required for practical applications in visual perception. In this work, we focus on evaluating the geometric understanding of these models, specifically targeting their ability to perceive the depth and height of objects in an image. To address this, we introduce GeoMeter, a suite of benchmark datasets - encompassing 2D and 3D scenarios - to rigorously evaluate these aspects. By benchmarking 18 state-of-the-art VLMs, we found that although they excel in perceiving basic geometric properties like shape and size, they consistently struggle with depth and height perception. Our analysis reveal that these challenges stem from shortcomings in their depth and height reasoning capabilities and inherent biases. This study aims to pave the way for developing VLMs with enhanced geometric understanding by emphasizing depth and height perception as critical components necessary for real-world applications.

GeoMeter -- Charateristic and Statistics

GeoMeter specifically probes depth and height perception, whereas previous benchmarks encompass general purpose recognition-based spatial reasoning tasks.

GeoMeter provides insight on current VLMs limitation in terms of complex visual perception capabilities.

GeoMeter contains programmatically generated 11.4k image-text pairs in depth and height categories including multiple unique query attributes and varying scene density, where the questions are MCQ and True/False types.

Samples from the proposed suite of benchmark datasets. Here each samples are shown with random query attributes- color and numeric label for GeoMeter-2D; and color and material for GeoMeter-3D dataset.

Sample image-text pair. Here, prompt template shows the basic template for each image-text pair in our datasets, where the prompt example is the actual prompt for the image. The prompt example is appended with either MCQ or True/False type question.

Quantitative Results

Performance comparison of the studied models on proposed datasets. The reported results are averaged across depth and height category, query attributes and scene density with top scores in bold.

Analysis

1. Models show basic visual reasoning capability but struggles in advance perception tasks.

We developed a specialized dataset called GeoMeter-2D-Basic to evaluate the fundamental visual reasoning capabilities of VLMs. This dataset focuses on basic geometric tasks like line understanding, shape recognition, shape counting, and assessing spatial relationships between shapes. While models perform well on these simpler tasks, they show significant difficulty with depth and height perception, revealing limitations in handling complex spatial reasoning. This highlights GeoMeter's usefulness in pinpointing gaps in VLM capabilities.

Performance of selected models on basic visual reasoning tasks (samples shown in left). Here, LU, SI, SC and SR respectively denote line understanding, shape identification, shape counting and spatial reasoning.

2. Height perception poses greater challenges than depth perception, especially in stacked object arrangements.

Models perform better on depth perception tasks than height perception, likely because training data contain simpler depth cues like occlusion and perspective. Height perception is more complex, involving vertical positioning and size relationships in stacked objects. Analysis from the GeoMeter-3D dataset shows a minor performance gap for single objects but a significant drop in height task accuracy with stacked objects, indicating that vertical spatial reasoning is particularly challenging for VLMs.

Here, ∆ denotes performance gap between depth and height perception, which grows even larger with stacked arrangement of objects, as opposed to single objects. This suggests that while models struggle with height perception in general, stacked objects further degrade their performance.

3. Models' limitation is due to inherent reasoning capability and not insufficient prompt detail.

To enhance reasoning, chain-of-thought prompting was applied. Despite providing detailed intermediate reasoning steps, top-performing models showed only slight performance gains, suggesting they already perform some internal reasoning. These results indicate that the models' limited depth and height perception stems from inherent spatial understanding limitations, highlighting the need for architectural improvements over prompt-based solutions.

Example of prompt engineering using chain of thought prompting.

Performance gain with chain of thought prompting over standard prompting.

4. Some open-source models are more biased towards picking True over False than others.

Some open-source models perform near chance (around 50% accuracy) on True/False questions, indicating a tendency to guess—often biased toward "True." This bias likely stems from training data imbalances with more affirmative statements and fewer false examples. Experiments confirm this behavior, with performance dropping when all answers are "False." Rather than true reasoning, models rely on heuristics, exposing their struggle with logical consistency and uncertainty in complex situation. This evaluation reveals a key weakness in context-driven reasoning and highlights the need for improved model training and design.

Effect of ground truth value in True/False questions. GT-R denotes randomly set ground truth between true and false; whereas GT-T/F denotes ground truth always true or always false.

5. Some open source models are more biased towards picking the first choice in case of MCQ.

Experiments show that open-source models are biased toward selecting the first MCQ option, especially when it's correct, likely due to training data patterns. Their performance drops when the correct answer is absent, revealing difficulty with “None of the above” choices and a reliance on heuristics over reasoning. In contrast, closed-source models remain consistent across answer placements. These findings suggest open-source models often use pattern recognition rather than true understanding, highlighting limitations in their decision-making and reasoning capabilities.

Effect of ground truth ordering in choices of MCQs. GT-C1 and GT-Ab denotes ground truth being choice 1 and not present respectively.

BibTeX

@misc{azad2025understandingdepthheightperception,
        title={Understanding Depth and Height Perception in Large Visual-Language Models}, 
        author={Shehreen Azad and Yash Jain and Rishit Garg and Yogesh S Rawat and Vibhav Vineet},
        year={2025},
        eprint={2408.11748},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2408.11748}, 
   }