Stanford University
1 Computer Science 🤖 2 Psychology 🧠
Humans excel at visual social inference: inferring hidden elements of a scene from subtle behavioral cues such as gaze, pose, and orientation. This ability drives everyday social reasoning and is critical for embodied AI. We introduce Spot the Ball, a benchmark to evaluate visual social inference in vision–language models (VLMs) using sports as a test domain. The task is to localize the missing sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) and three prompting strategies, finding that humans are consistently two to three times more accurate (20–34%) than models (17%) across all sports. Error analyses reveal that models exhibit systematic, unproductive biases and fail to leverage the social cues that humans rely on such as gaze and body pose. Our results highlight a persistent human–model gap in visual social inference and emphasize the need for architectures that explicitly integrate structured behavioral cues for robust multimodal reasoning.
Our Spot The Ball benchmark is built using a scalable generation pipeline that processes raw broadcast sports videos into structured inference tasks with human annotations. The pipeline consists of four stages:
We retrieve YouTube clips using sport-specific queries to bias toward high-action moments with clear player interactions.
Frames are filtered using multimodal similarity to prompts toretain only visually informative scenes with a visible ball and players.
An object detector identifies ball and player locations. The ball is removed and realistically filled using diffusion-based inpainting.
A 6×10 grid is applied, ground-truth locations are recorded, and scenes are prepared for model and human evaluation.
We evaluate human participants and four state-of-the-art VLMs across three sports and multiple prompting strategies. Models consistently underperform humans not only in accuracy but also in spatial precision and distributional similarity.
2.0-flash-001
Google's latest multimodal model with enhanced vision capabilities and reasoning abilities.
4.1-mini
OpenAI's efficient multimodal model optimized for speed and cost-effectiveness.
3.2-11B-Vision
Meta's open-source vision-language model with 11B parameters and instruction tuning.
2.5-VL-7B
Alibaba's vision-language model with strong multimodal understanding capabilities.
(Level 0)
The ball has been removed from this {sport} image. Your task is to infer the most likely location of the ball.
Respond in the following format:
Reasoning: <Explain where the ball is likely located and why.>
Cell: <What grid cell is the ball most likely located in? Respond with a label like F4.>
(Level 1)
The ball has been removed from this {sport} image. Your task is to infer the most likely location of the ball.
The location of the players, where they are looking and their positions can help you infer the location of the ball.
Respond in the following format:
Reasoning: <Explain where the ball is likely located and why.>
Cell: <What grid cell is the ball most likely located in? Respond with a label like F4.>
(Level 2)
Prompt 1 (Context Collection):
• Where are the players located?
• Where are the players looking?
• How are the players positioned?
Prompt 2 (Final Prediction):
The ball has been removed from this {sport} image. Here are some observations:
{context}
The above information could help you infer the ball's location.
Respond in the following format:
Reasoning: <Explain where the ball is likely located and why.>
Cell: <What grid cell is the ball most likely located in? Respond with a label like F4.>
Models are less accurate than humans in all sports and levels as shown by the accuracy plot below and the euclidean distances to the ground thruth are also very large.
Figure: Accuracy across models vs. humans.
Volleyball is the most difficult for models but humans find it medium difficulty this can explain that there are differences in how they reason about the task. These differences can be explained by the following two heuristics.
Figure: Proximity to players across models vs. humans.
The most commonly used heuristics are:
Chain-of-thought prompts are more likely to improve models to reason like humans (about both pose and gaze evenly).
Figure: Pose and gaze counts by model and level.
To better understand failure modes, we visualize model predictions alongside human guesses. Despite accurately describing scene elements in chain-of-thought prompts, models frequently disregard gaze direction, confuse player roles, or default to spatial heuristics such as center bias.
Model prediction clusters near player feet despite all player gaze converging left.
Model incorrectly infers ball possession role, leading to distant guess from true location.
Model defaults to center of frame (net area), a low-probability ball location in volleyball.
Our results highlight fundamental limitations in current vision–language models when it comes to socially grounded object inference. Prompting strategies are insufficient to close the gap with humans, pointing to deeper issues in how models perceive and reason about social cues such as pose, and gaze.
Several future directions emerge from this work:
It is worth noting that training a neural network to localize balls in images is not, by itself, a difficult task. What makes this benchmark distinctive is that success requires social reasoning under uncertainty: inferring hidden states from cues that humans naturally exploit. Addressing this challenge may require not just additional supervision but fundamentally different training objectives or architectures that embed social priors.
The goal of this work is therefore not to provide a final solution, but to expose systematic blind spots in current models and to motivate progress on this problem. By releasing both our dataset and evaluation code, we hope to establish a foundation that the community can build upon in developing models with more human-like capacities for socially grounded inference.
If you find this benchmark or dataset useful in your research, please cite:
@misc{balamurugan2025spottheball,
title = {Spot The Ball: A Benchmark for Visual Social Inference in Sports Scenes},
author = {Balamurugan, Neha and Wu, Sarah and Eyzaguirre, Chris and Chun, Adam and Gaw, Gabe and Gerstenberg, Tobias},
booktitle = {Fill in later},
year = {2025}
}