We introduce a grounded benchmark for physical video understanding that evaluates what, when, and where reasoning across multiple video sources, physics domains, prompt families, and perturbation settings. The results show that spatial grounding remains the weakest part of current video question-answering systems and motivate physically grounded, prompt-aware diagnostics alongside aggregate accuracy.
@article{osmanli2026grounding,title={Grounding Video Reasoning in Physical Signals},author={Osmanli, Alibay and Cheng, Zixu and Gong, Shaogang},journal={arXiv preprint arXiv:2604.21873},year={2026},doi={10.48550/arXiv.2604.21873},}