ARGUS: Hallucination and Omission Evaluation in Video-LLMs


University of Maryland, College Park
  *Equal Contribution

We present ARGUS, a novel framework to measure hallucinations (ArgusCost-H; lower-better) and omissions (ArgusCost-O; lower-better) in free-form video caption generation.
teaser

Current evaluation strategies rely on a question-answering (QA) paradigm. We allow the model to generate a free-form caption first and then quantify it's degree of hallucinations and omissions.

We introduce dual metrics ArgusCost-H (hallucination) and ArgusCost-O (omission) to quantify how well Video-LLMs stay faithful and complete when generating free-form dense captions.

Relationship between hallucination and omission
Figure 1 – Most models omit more than they hallucinate; Gemini-2.0-Flash leads with the lowest combined cost.

Key Contributions

  • 📊 ARGUS benchmark – 500 diverse videos with rich 477-word human captions and ≈19 evaluation points per clip.
  • ⚖️ Dual-metric evaluation capturing both hallucination (ArgusCost-H) and coverage (ArgusCost-O) at the sentence level.
  • 🏆 Comprehensive study of 23 Video-LLMs (18 open-source + 5 proprietary) revealing that even state-of-the-art models still hallucinate ~40% of the time.
  • 📈 Checkout the paper for analyses on scale, frame sampling, and prompt sensitivity; providing actionable diagnostics for future model design.

ArgusBench: What's inside?

  • 🎥 ArgusBench draws 500 videos (15 - 90 s).
  • 🗿 Human annotators provide dense captions averaging 477 words24.4 w/s - an order-of-magnitude richer than prior datasets.
  • 🔖 Every sentence is type-tagged (summary, visual description, dynamic action) enabling fine-grained cost accounting.

Clip characteristics

Figure 5 - Length, word-count and sentence-type statistics for ArgusBench.

Cite Us