CinePile: A Long Video Question Answering Dataset and Benchmark

▶ University of Maryland, College Park ▶ Weizmann Institute of Science ^*Equal Contribution

We present CinePile, a long-form video understanding dataset, created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data.

Our dataset comprises Multiple-Choice Questions across 86 diverse question templates, such as Emotional Transition, Object's Description, etc., generated automatically using Gemini. These templates are categorized into five high-level categories: Character and Relationship Dynamics (CRD), Narrative and Plot Analysis (NPA), Thematic Exploration (TE), Setting and Technical Analysis (STA), and Temporal (TE).

A look at the dataset

Collecting video clips
We divided 9,396 movie clips sourced from the Movieclips YouTube channel into training and testing splits, containing 9,248 and 148 videos respectively. Following our question-answer generation and filtering pipeline, we produced 298,888 training points and 4,940 test-set points, averaging about 32 questions per video scene.

Dataset statistics
For each question, we provide additional information, such as whether the questions are particularly challenging to answer (hardness) or if they require visual information (visual reliance) for an accurate response.

Model evaluation

Model evaluation strategy
Given that our dataset is of type multiple-choice question answers (MCQs), we evaluate a given model’s performance on our benchmark questions by measuring its ability to accurately select the right answer from a set of multiple-choice options, containing only one correct answer and four distractors. Our evaluation method incorporates a two-stage process to first reliably extract the selected choice from a model's response, and then compare whether the extracted response matches the correct answer key.

Evaluation Results
Among the various commercial VLMs analyzed, GPT-4o, GPT-4 Vision and Gemini 1.5 Pro (Video) emerge as top performers, each achieving around 60% average performance. We note that all the models observe a drop of 15%-20% when evaluated on the hard split.

Cite Us

@article{rawal2024cinepile,
  title={CinePile: A Long Video Question Answering Dataset and Benchmark},
  author={Rawal, Ruchit and Saifullah, Khalid and Basri, Ronen and Jacobs, David and Somepalli, Gowthami and Goldstein, Tom},
  journal={arXiv preprint arXiv:2405.08813},
  year={2024}
}