Shot2Story

A New Benchmark for Comprehensive Understanding of Multi-shot Videos

*Equal contribution
1ReLER, AAII, University of Technology Sydney, 2Bytedance Inc., 3Data61, CSIRO, 4Department of Computer Vision, MBZUAI
🌟 Online demo is back in service: HF Space
Please feel free to raise the issue on Github or HF Space if the demo fails again.
🌟 Multi-shot videos download: HF storage
In additional to OneDrive storage, we provide additional Huggingface storage of the multi-shot videos. Please follow the release license.
🌟 5 June Release: Shot2Story-QA benchmark
Explore 11K question-answering pairs for benchmarking video-language models on multi-shot understanding. The quality is ensured by human annotation and verification.
πŸš€ Demo Update: ChatBot Demo and SumBot Demo
In line with our most recent data release, the demo has been updated. Please take a moment to explore our powerful video summarization model.
🌟 24 April Release: 134K version (QA on the way): 43K manual + 90K GPT-V
Explore 134k videos with detailed text summaries, consisting of 43K human-annotated and 90k GPTV generated data. Moreover, we release 188K video shots with human-annotated visual captions, and 95K shots with narration captions.
🌟 Data instruction: Please find our instructions for using and downloading data here.
πŸš€ Code Release: Video Summarization & Shot Captioning Code
Dive into our codes for video summarization and shot captioning, enhancing visual-audio video analysis. Stay tuned for more tasks and codes coming soon!
More news
πŸš€ Demo Release: SUM-shot model
Detailed and grounded video summaries are powerful! Please check our demo for SUM-shot model. Chat-SUM-shot model is on the way!
🌟 New Release: 20K version of Shot2Story
Explore 20k videos with detailed human-annotated summaries; 80k video shots with visual captions, and 40k shots with narration captions for comprehensive audio-visual analysis.

In light of large models, we introduce the Shot2Story benchmark: a multi-shot video resource with detailed textual annotations, ideal for training and validating diverse, temporally distinct video tasks. Check the ChatBot and SumBot powered by Shot2Story.

Abstract

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it.

In this work, we present a new multi-shot video understanding benchmark \dataset with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering.

Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.




Shot2Story


We provide 43K videos with diverse topics and contents. Each video is annotated with shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations.


Dataset glance

The dataset includes 43K multi-shot videos, with an average of 4.4 shots per video, resulting in a total of 188k video shots, each with detailed video caption and narration caption annotations. The average length of our video summaries is 218.3, while the average length of a video is 17.1s. In addition, we annotate another 90K multi-shot videos with GPT-V, to facillate better video-language model training.

For more comprehensive details, please refer to the plots below.


Comparison to datasets

High level comparison of our dataset to previous ones. The summary length of ActivityNet and YouCook2 are their combined length of captions in one video. M and G stands for manual and generated, respectively.


Baselines and Tasks

We experiment on tasks like single-shot video captioning, video summarization and video question-answering. Papers and codes are released. We also provide online demos of ChatBot and SumBot. Please have a look at it!


Single-shot video captioning - visual and audio

This task involves generating descriptions for individual video shots, where the target description is a concatenation of the visual-only and narration caption for a video shot. This task requires a joint understanding of visual and speech information. Model structure is shown below. Visual tokens from the CLIP visual backbone and Q-Former (together with a linear layer from MiniGPT4), along with text prompts, form the input to the Vicuna.

We experiment VAST with two settings (V+A+S and V+S. V, A and S stand for vision, audio and subtitle respectively), MiniGPT4-C and VideoChat2-C (C stands for captioning). The models are trained on our Shot2Story single-shot video caption data. See the results below.


Video summarization - visual and audio

Multi-shot video summarization is a new task that is distinct from existing video description tasks. It requires the model to understand the shot structure of the given video and to provide a coherently paragraph to describe the progression of events in the different shots.

We propose SUM-shot model, as a powerful baseline for multi-shot video analysis. We sample 4 frames in each video shot and prompting the LLM with frame tokens from different shots, as shown below.

We experiment with different models, such as the Video-ChatGPT, MiniGPT4-SUM-shot, MiniGPT4-holistic (which doesn't have shot information), SUM-shot w/o ASR (which doesn't ASR text as input), and VideoChat2-SUM-shot (equipped with advanced vision backbone and video pretraining). Experiment shows that shot information and ASR information are crucial to multi-shot video summarization. VideoChat2-SUM-shot further confirms the importance of advanced visual representations.


Multi-shot video question-answering

Generated video summaries are supposed to be grounded and detailed, covering rich elements like event progression, holistic topics and audio elements, making them suitable for other vision tasks such as video question-answering. Existing works uses image or video frame captions as input to an LLM to generate question responses. However, little work has been done for the capacity of video summaries. We directly apply our video summarization model on video QA benchmarks, i.e. MSRVTT-QA, ActivityNet-QA and our Shot2Story-QA.

Specifically, we first split the testing videos into video shots, and then feed the videos into our SUM-shot models. The generated summaries and the associated questions are then fed into a Vicuna model to derive the answers. We leverage the gpt-3.5-turbo model to generate a binary decision on whether the answer is correct.

we benchmark existing and our proposed video summary models on Shot2Story-QA. Specifically, four popular video-VLMs are compared, i.e., Video-ChatGPT, LLaMA-VID, VideoChat2 and Video-LLaVA. The predicted summaries of MiniGPT4-SUM-shot and VideoChat2-SUM-shot are used to tackle the QA task with the same configuration as zero-shot VQA. Accuracies on temporal-related, multi-shot holistic understanding and audio-related are reported, with the β€œoverall” metric showing the average score from these three sub-tasks. The current video-VLMs present unsatisfying results, potentially due to two factors: (1) The current models does not have audio or ASR as input, lacking capacity with audio-related understanding. (2) Current models do not have training data with detailed descriptions based on multi-shot videos, weakening their performance on holistic understanding and temporal modeling. For our models, VideoChat2-SUM-shot achieves an overall score of 40.5, surpassing the compared models and MiniGPT4-SUM-shot on all three subtasks. This performance underscores the benefits of video pretraining and the advanced visual backbone of VideoChat2. These baseline results highlight the complexities and demanding nature of our Shot2Story-QA task.


Zero-shot video question-answering

As shown in the below table, our results with VideoChat2-SUM-shot surpass 5 out of 6 existing video-VLMs on MSRVTT-QA and 4 out of 6 existing models on ActivityNet-QA. Furthermore, our results are comparable to the SOTA performance on MSRVTT-QA with Video-LLaVA.

Note that these models require extensive instruction-tuning data to learn to directly generate answers from visual features and the text prompt, whereas our model bypasses instruction tuning by distilling the video information into a summary. Our model also follows the zero-shot QA setting since the model only uses Shot2Story as training data. Note that MSRVTT contains a large portion of videos with out-of-domain topics such as TV shows and food, while ActivityNet has much longer videos than our training videos. This validates the robustness and transferability of our model across different topics and longer videos. This surprisingly good result indicates that a comprehensive and detailed video summary is a high-quality abstraction of the video,facilitating a wide range of tasks including video QA and video-based conversation

BibTeX


        @article{han2023shot2story20k,
          title={Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos}, 
          author={Mingfei Han and Linjie Yang and Xiaojun Chang and Heng Wang},
          journal={arXiv preprint arXiv:2311.17043},
          year={2023}
        }