A New Benchmark for Comprehensive Understanding of Multi-shot Videos

*Equal contribution
1ReLER, AAII, University of Technology Sydney, 2Bytedance Inc., 3Data61, CSIRO, 4Department of Computer Vision, MBZUAI
🌟 5 June Release: Shot2Story-QA benchmark
Explore 11K question-answering pairs for benchmarking video-language models on multi-shot understanding. The quality is ensured by human annotation and verification.
🚀 Demo Update: ChatBot Demo and SumBot Demo
In line with our most recent data release, the demo has been updated. Please take a moment to explore our powerful video summarization model.
🌟 24 April Release: 134K version (QA on the way): 43K manual + 90K GPT-V
Explore 134k videos with detailed text summaries, consisting of 43K human-annotated and 90k GPTV generated data. Moreover, we release 188K video shots with human-annotated visual captions, and 95K shots with narration captions.
🌟 Data instruction: Please find our instructions for using and downloading data here.
🚀 Code Release: Video Summarization & Shot Captioning Code
Dive into our codes for video summarization and shot captioning, enhancing visual-audio video analysis. Stay tuned for more tasks and codes coming soon!
More news
🚀 Demo Release: SUM-shot model
Detailed and grounded video summaries are powerful! Please check our demo for SUM-shot model. Chat-SUM-shot model is on the way!
🌟 New Release: 20K version of Shot2Story
Explore 20k videos with detailed human-annotated summaries; 80k video shots with visual captions, and 40k shots with narration captions for comprehensive audio-visual analysis.

In light of large models, we introduce the Shot2Story benchmark: a multi-shot video resource with detailed textual annotations, ideal for training and validating diverse, temporally distinct video tasks. Check the ChatBot and SumBot powered by Shot2Story.


A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it.

In this work, we present a new multi-shot video understanding benchmark \dataset with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering.

Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.


We provide 43K videos with diverse topics and contents. Each video is annotated with shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations.

Dataset glance

The dataset includes 43K multi-shot videos, with an average of 4.4 shots per video, resulting in a total of 188k video shots, each with detailed video caption and narration caption annotations. The average length of our video summaries is 218.3, while the average length of a video is 17.1s. In addition, we annotate another 90K multi-shot videos with GPT-V, to facillate better video-language model training.

For more comprehensive details, please refer to the plots below.

Comparison to datasets

High level comparison of our dataset to previous ones. The summary length of ActivityNet and YouCook2 are their combined length of captions in one video. M and G stands for manual and generated, respectively.

Baselines and Tasks

We experiment on tasks like single-shot video captioning, video summarization and video question-answering. Papers and codes are released. We also provide online demos of ChatBot and SumBot. Please have a look at it!

Single-shot video captioning - visual and audio

This task involves generating descriptions for individual video shots, where the target description is a concatenation of the visual-only and narration caption for a video shot. This task requires a joint understanding of visual and speech information. Model structure is shown below. Visual tokens from the CLIP visual backbone and Q-Former (together with a linear layer from MiniGPT4), along with text prompts, form the input to the Vicuna.

We experiment VAST with two settings (V+A+S and V+S. V, A and S stand for vision, audio and subtitle respectively), MiniGPT4-C and VideoChat2-C (C stands for captioning). The models are trained on our Shot2Story single-shot video caption data. See the results below.

Video summarization - visual and audio

Multi-shot video summarization is a new task that is distinct from existing video description tasks. It requires the model to understand the shot structure of the given video and to provide a coherently paragraph to describe the progression of events in the different shots.

We propose SUM-shot model, as a powerful baseline for multi-shot video analysis. We sample 4 frames in each video shot and prompting the LLM with frame tokens from different shots, as shown below.

We experiment with different models, such as the Video-ChatGPT, MiniGPT4-SUM-shot, MiniGPT4-holistic (which doesn't have shot information), SUM-shot w/o ASR (which doesn't ASR text as input), and VideoChat2-SUM-shot (equipped with advanced vision backbone and video pretraining). Experiment shows that shot information and ASR information are crucial to multi-shot video summarization. VideoChat2-SUM-shot further confirms the importance of advanced visual representations.

Zero-shot video question-answering

Since the generated summaries are long and complex, the traditional captioning metrics (B, M, R, C) may not reflect the true quality of the generated summaries. We thus adopt another video understanding task, zero-shot video question-answering (QA), to further evaluate the quality of our generated summaries. Specifically, we directly apply our video summarization model on video QA benchmarks MSRVTT-QA and ActivityNet-QA by splitting the testing videos into video shots and feeding them into the SUM-shot model. The generated summaries and the associated questions are then fed into a Vicuna model to derive the answers.


          title={Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos}, 
          author={Mingfei Han and Linjie Yang and Xiaojun Chang and Heng Wang},
          journal={arXiv preprint arXiv:2311.17043},