ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli^*, Tom Goldstein^*,

University of Maryland, College Park

^*Equal Contribution

We present ARGUS, a novel framework to measure hallucinations (ArgusCost-H; lower-better) and omissions (ArgusCost-O; lower-better) in free-form video caption generation.

📊 ARGUS benchmark – 500 diverse videos with rich 477-word human captions and ≈19 evaluation points per clip.
⚖️ Dual-metric evaluation capturing both hallucination (ArgusCost-H) and coverage (ArgusCost-O) at the sentence level.
🏆 Comprehensive study of 23 Video-LLMs (18 open-source + 5 proprietary) revealing that even state-of-the-art models still hallucinate ~40% of the time.
📈 Checkout the paper for analyses on scale, frame sampling, and prompt sensitivity; providing actionable diagnostics for future model design.

🎥 ArgusBench draws 500 videos (15 - 90 s).
🗿 Human annotators provide dense captions averaging 477 words → 24.4 w/s - an order-of-magnitude richer than prior datasets.
🔖 Every sentence is type-tagged (summary, visual description, dynamic action) enabling fine-grained cost accounting.

Figure 5 - Length, word-count and sentence-type statistics for ArgusBench.